Chapter 20

Gateway Programming Fundamentals


CONTENTS


Chapter 19 laid the groundwork for practical CGI programming. Now it is time to focus on the essentials of gateway web development: how to use CGI environment variables and how to manipulate standard input to receive and process the client request. The goal, in broad terms, is to create a CGI program that builds a response and prefaces it with a necessary MIME header. This response is highly flexible; it can be HTML, another data type, or it might build another form for the client to fill out. Recall the thematic HyperText Transfer Protocol elements of openness and extensibility throughout the discussion.

Perl and the Bourne Shell are used to explain the fundamentals of environment variables, MIME types, and data-passing methods. I then present practical Perl and Bourne Shell scripts to illustrate these points.

Multipurpose Internet Mail Extensions (MIME) in the CGI Environment

The novice web developer's bane is the failure to pay attention to the strict MIME requirements the HTTP imposes on the client request-server response cycle.

When a client request arrives via a METHOD=GET or METHOD=POST (refer to Chapter 19 for introductory remarks on these methods) and a CGI program executes to fulfill the request, data of one form or another is written to standard out (stdout-the terminal screen if the program is run as a stand-alone program) and then sent by the server to the client. The very first print statement must output a string of this form:

Content-type type/subtype <line feed> <line feed>

Perl uses \n as the line feed escape sequence, for example, and therefore must start output of plain text or HTML with this statement:

print "Content-type: text/html \n\n";

The type text refers to the standard set of printable characters; historically, the subtype plain is defined. On the Web, html is an additional subtype-plain text with HTML formatting tags added. Web clients can handle the formatting of HTML directly.

Note that the first \n escape causes the first line feed to go to line 2 of the output, and the second \n escape ensures a completely blank second line.

The next little "Hello, World!" Bourne Shell script demonstrates the MIME header requirements without the benefit of a Perl \n escape sequence. In the Bourne Shell, the echo statement is the brute force line-feed method:

#!/bin/sh
echo "Content-type: text/html"
echo
echo "<HTML>"
echo "<HEAD><TITLE>Hello</TITLE></HEAD>"
echo "<BODY>Hello, World!</BODY></HTML>"

Caution
If the second line of the CGI script's output is not completely blank, the script will not run. If the developer is confronted by code that is syntactically correct, runs on the command line, but dies swiftly and mysteriously in a Web environment, a malformed MIME header might be the culprit. See "Code Debugging," later in this chapter, for more details.

Consider the following equivalent two-line code in the Bourne Shell:

echo "Content-type: text/html"
echo

Again, the second line of output is blank.

Tip
Be aware of standard Perl toolkits that web developers can take advantage of. The NCSA httpd server distribution includes the useful cgi-handlers.pl, (See note ) which includes the following html_header subroutine to ensure a proper MIME header:
#
# from the cgi-handlers.pl package
#
sub html_header {
     local($title) = @_;

     print "Content-type: text/html\n\n";
     print "<html><head>\n";
     print "<title>$title</title>\n";
     print "</head>\n<body>\n";
}
This handy subroutine accepts an argument that forms the title of the HTML response, outputs the required MIME header, inserts the title within the HTML <head> and </head> tags, and then outputs the HTML tab <body>. The body of the response follows.

Other important type/subtype pairs are worth mentioning: image/gif is decoded inline by all Web graphical clients; image/jpeg is not universally decoded inline. In UNIX, the important file ~/.mailcap (the ~/ prefix means that this file is in the user's home directory) is the map between MIME extensions and external executable files that can handle the corresponding multimedia extension. Here is a sample ~/.mailcap file:

audio/*; showaudio %s
video/mpeg; mpeg_play %s
image/*; xv %s
application/x-dvi; xdvi %s

If the image/jpeg is not decodable inline by the client directly, for example, the .mailcap file is referenced. The line starting with image/* is found and the corresponding viewer, xv, is spawned with the file name as its argument. Note that external viewers spawn processes that are independent from the client Web session. After the Web session terminates, external viewers still may be active. Plugins, popularized by Netscape in 1996, are quite different animals than external viewers. Adobe, for example, makes external viewers (Acrobat Reader and Acrobat Exchange, for example) for Portable Document Format (PDF) files, and starting with its Amber (Version 3.x) product line, its viewer now is integrated with the Netscape browser as a plugin. Plugin viewers extend the browser's functionality; in the case of Amber, PDF files can be viewed inline (integrated in the browser window with an additional Adobe toolbar). Plugins are not defined in the ~/.mailcap file; they ship with their own idiosyncratic installation instructions.

The CGI programmer must have a good understanding of the set of available environment variables.(See note)

When the client sends a request and the gateway program executes, the CGI programmer has access to the full set of environmental variables. These variables fall into two broad categories:

The first type is independent of the client request and has the same value no matter what the request. These values are properties of the server that also are known as server metainformation.

The second type does depend on the client request. Most of these are client-specific, but some do depend on the server to which the request is being sent.

CGI programs sometimes rely on the contents of some of these variables to fulfill the client request. Other variables are not essential to logical processing but can be manipulated and echoed back to the user for cosmetic or informational reasons. Examples of both scenarios are given in this chapter. In bimodal.pl, the variable $ENV{'REMOTE_USER_AGENT'} is queried to determine the interface type. After that, I illustrate environmental variables serving a useful purpose in a Perl-to-e-mail gateway.

Here are important examples of both types. You can look on-line for a discussion of the full range of environmental variables; the definitions that follow also come from the on-line NCSA documentation.

HTTP Information about the Server That Does Not Depend on the Client Request

AUTH_TYPE  If the server supports user authentication and the script is protected, this is the protocol-specific authentication method used to validate the user.

CONTENT_LENGTH  The length of data buffer sent by the client. The CGI script reads the input buffer and uses the CONTENT_LENGTH to cut off the data stream at the appropriate point.

CONTENT_TYPE  For queries that have attached information, such as HTTP POST and PUT, this is the content type of the data.

GATEWAY_INTERFACE  The server CGI type and revision level. Format: CGI/revision.

HTTP_USER_AGENT  The browser that the client is using to send the request. General format: software/version library/version.

PATH_INFO  As you saw in Chapter 19, extra path information can be communicated by client by the following:

METHOD=GET(POST) ACTION= http://machine/path/progname/extra-path-info

The extra information is sent as PATH_INFO.

PATH_TRANSLATED  The server translates the virtual path represented in PATH_INFO and translates it to a physical path.

QUERY_STRING  The information that follows the ? in the URL that referenced this script. This variable was introduced in Chapter 19 as a technique to pass data to the CGI program.

REMOTE_ADDR  The IP address of the client.

REMOTE_HOST  The client host name. If the server does not have this information, it should set REMOTE_ADDR and leave this unset.

REMOTE_IDENT  If the HTTP server supports RFC 931 identification, this variable is set to the remote user name retrieved from the server. Using this variable should be limited to writing to the log file only (be careful not to compromise unwittingly the privacy of the user).

Caution
It is very dangerous, for performance reasons, for the web server administrator to turn on RFC 931, also known as ident. Granted, developers and administrators often are curious about identifying users accessing the web site. Ident adds an extra preliminary chat step between client and server, however, and only if the client is running ident and the server is the user ID identified. Empirically, this occurred on the EDGAR server for less than 10 percent of the accesses in July and August 1994. Worse, according to Rob McCool (formerly of NCSA Mosaic's development team, now at Netscape Communications Corporation), the use of ident on the server side can cause great headaches to clients hiding behind corporate firewalls. The preliminary conversation, in which the server queries the firewall in an attempt to identify the client, confuses and even might hang those clients. By way of anecdotal evidence, I have noticed during my reign as Web master at the NYU EDGAR development site that several large corporate clients did suffer inexplicable delays when my server's ident was on.

REMOTE_USER  If the server supports user authentication and the script is protected, this is the authenticated user name.

REQUEST_METHOD  The HTML form uses a METHOD=GET or a METHOD=POST; these two are the most likely ones that the CGI programs have to face.

SCRIPT_NAME  A virtual path to the script being executed, used for self-referencing URLs such as ISINDEX queries.

SERVER_NAME  The server's host name, DNS alias, or IP address.

SERVER_PORT  The port number to which the client request was sent. Recall that port 80 is the http standard.

SERVER_PROTOCOL  The protocol that the client request is using: HTTP 1.0 or the more recent HTTP 1.1. Format: protocol/revision.

SERVER_SOFTWARE  The name and version of the Web server. Format: name/version.

The test-cgi Bourne Shell script from NCSA displays some of these variables, as shown in Listing 20.1.


Listing 20.1. The NCSA test-cgi Bourne Shell script.
#!/bin/sh

echo Content-type: text/plain
echo
echo CGI/1.0 test script report:
echo
echo argc is $#. argv is "$*".
echo
echo SERVER_SOFTWARE = $SERVER_SOFTWARE
echo SERVER_NAME = $SERVER_NAME
echo GATEWAY_INTERFACE = $GATEWAY_INTERFACE
echo SERVER_PROTOCOL = $SERVER_PROTOCOL
echo SERVER_PORT = $SERVER_PORT
echo REQUEST_METHOD = $REQUEST_METHOD
echo HTTP_ACCEPT = $HTTP_ACCEPT
echo PATH_INFO = $PATH_INFO
echo PATH_TRANSLATED = $PATH_TRANSLATED
echo SCRIPT_NAME = $SCRIPT_NAME
echo QUERY_STRING = $QUERY_STRING
echo REMOTE_HOST = $REMOTE_HOST
echo REMOTE_ADDR = $REMOTE_ADDR
echo REMOTE_USER = $REMOTE_USER
echo CONTENT_TYPE = $CONTENT_TYPE
echo CONTENT_LENGTH = $CONTENT_LENGTH

Figure 20.1 shows the result of the test-cgi environmental variable report.

Figure 20.1 : Sample output from NCSA's test-cgi Bourne Shell script.

Server-side includes (SSIs) use special extensions to HTML tagging.(See note) SSI files look like HTML; they use the HTML tagging conventions. They are not quite the same as regular HTML files, however. I mention them here because they make interesting use of a superset of CGI environmental variables. They aren't strictly part of CGI programming, because HTML document preparers can use them without interfacing with a gateway program.

The best way to understand SSI directives is to look at a simple example of the SSI tags, tools.shtml, as shown in Listing 20.2.


Listing 20.2. The tools.shtml code.
<title> Filing Retrieval Tools </title>

<A HREF="http://edgar.stern.nyu.edu/formco_array.html">
<h2> Company Search </a></h2>

<A HREF="http://edgar.stern.nyu.edu/formlynx.html">
<h2> Company and Filing Type Search </a></h2>

<A HREF="http://edgar.stern.nyu.edu/formonly.html">
<h2>Form ONLY! Lookup</A></h2>

<A HREF="http://edgar.stern.nyu.edu/form2date.html">
<h2>Form and Date Range Lookup </A></h2>

<A HREF="http://edgar.stern.nyu.edu/current.html">
<h2> Current Filing Analysis </a> </h2>

<A HREF="http://edgar.stern.nyu.edu/mutual.html">
<h2> Mutual Funds Retrieval </a></h2>

<A HREF="http://edgar.stern.nyu.edu/EDGAR.html">
<img src="http://edgar.stern.nyu.edu/icons/back.gif">
Return to Home Page</a>

This toolkit was last modified on <!--#echo var="LAST_MODIFIED" -->
<!--#include virtual="/mgtest/" file="included.html" -->

Note that the document in Listing 20.2 has the odd extension of shtml. This is because my server is configured to recognize shtml as a file containing SSI tags. When my server receives a request to show a file with SSI directives, it must parse the document into HTML; only then is it returned to the client. Thus, the parsing represents a performance hit that the client must suffer. The upside is that the included information is dropped in on-the-fly at request time. The web developer should note that the Web master must take the necessary steps beforehand to configure the server to understand SSIs (enabling them in selected directories and defining a magic extension such as *.shtml that alerts the server to expect the extension tags). It would be a poor idea to enable SSIs on all *.html files because the server would have to parse every *.html file served (a big performance hit).

What does the tools.shtml file do? Before the server returns this document to the client, it parses the SSI directives. There are two such directives in Listing 20.2. The first,

<!--#echo var="LAST_MODIFIED" -->

instructs the server to resolve the variable LAST_MODIFIED and echo it in place. The second,

<!--#include virtual="/mgtest/" file="included.html" -->

is a directive to the server to include the file included.html in the HTML output, and the virtual tag tells the server that the directory alias mgtest should be suffixed to the document root.

Figure 20.2 shows the client's view of tools.shtml after it is parsed by the server.

Figure 20.2 : The client requests tools.shtml; the server parses the serve-side includes and returns HTML.

It is possible to include, at request time, other information, such as a file size (substitute #fsize for #include in Listing 20.2).

The following variables (not part of the core set of CGI environment variables) also are available to be displayed via the echo directive:

DATE_GMT  The current date using Greenwich Mean Time.

DATE_LOCAL  The current date using the local time zone.

DOCUMENT_NAME  The current file name.

DOCUMENT_URI  The virtual path to the document (starting from the server's document root).

LAST_MODIFIED  The last date and time that the current file was "touched." If you want to display the modification date of included.html, for example, the following directive would do the trick:

<!--#flastmod virtual="/mgtest/" file="included.html" -->

QUERY_STRING_UNESCAPED  The unescaped QUERY_STRING environment variable sent by the client.

Caution
Server-side includes can be very dangerous. If the Web master defines html as the SSI extension, every HTML file will be parsed prior to returning to the client-a huge performance hit. SSIs pose no special security risk (no more so than CGI scripts, as long as the site administrators are aware that non-traditional CGI directories now are launching CGI scripts), but you must consider their potential to drag down the site's performance before you use them.
Another (rather improbable) danger is the infinite loop. If I construct a file (let's call it loop.shtml), and somewhere in that file, include the line
<!--#include virtual="/mgtest/" file="loop.shtml" -->
the file loop.shtml is dropped in within loop.shtml, again and again, ad infinitum-a recursive loop.
The web developer should make an independent judgment when weighing the performance loss of SSIs against the utility of showing useful information such as the file modification.

Perl, C-Shell, Bourne Shell, and other UNIX command shells are all interpreted scripting languages. They generally start with

#!<path>/<binary-executable>

If there is uncertainty about where the interpreter (for example, Perl) resides, the following UNIX command will locate it:

which perl

Perl often is installed by the superuser in the /usr/local/bin directory. Thus, Perl programs at many installations start with

#!/usr/local/bin/perl

and shell programs usually start with

#!/bin/sh

Thereafter, the scripts are checked one line at a time by the interpreter for syntactic correctness. They run slower than compiled code (for example, C or C++), but if the underlying data is well organized, even multimegabyte datastores can be managed effectively.

Caution
The web developer must know how the Web site administrator has configured the server's capability to execute CGI scripts. Only a few directories are eligible to run CGI scripts; alternatively, the server might allow CGI programs to be in all the HTML
directories. In other words, it is insufficient to turn on the execute bits in UNIX, check the syntax, and hope that the script runs. If a script is in an invalid location, the server might output an Authorization Failed message or, worse, it might die silently. Furthermore, the file extension often is critical. It is a common configuration of NCSA servers to recognize extensions of *.csh (C-Shell), *.pl (Perl), *.sh (Bourne Shell), and *.cgi (generic CGI scripts) as legitimate CGI scripts. Some servers-for example, Netscape's-default to allowing only *.cgi as an executable extension. This is another argument to (1) make friends with your system administrator, and (2) avoid oddball script file extensions.

In gateway programming, it is easy to envision the script returning simple lines of formatted output in response to a client's data request. The reader should keep in mind, however, that scripts just as easily can output valid HTML that the server will return to the client. A client therefore can go directly to the URL of a gateway program, which then executes and displays HTML on the client screen. This might be a form that posts data to yet another script (I demonstrate this technique in Chapter 21's discussion of the company-stock ticker application). Or, the script program is gathering important information about the client and outputs the appropriate HTML, as I show later in this chapter with the bimodal.pl example.

Although Perl or C generally are the languages of choice for a budding developer, some people might not have access to Perl or might find C difficult to learn.

To further demonstrate the basics of the various methods of sending and receiving data between the client and CGI program, I start with simple Bourne Shell examples. The Bourne Shell, sh, is available on all UNIX boxes (well… it should be!) and these examples easily are adaptable to almost any other environment that has a batch command-line processing language and/or a shell with environment variables.

From Client to Server to Gateway and Back

A developer needs to understand three areas in client-to-server-to-gateway communication:

How a client can send data
How the server can pass that data to the gateway program
How the gateway can send data back to the server and then back to the client

The two basic means for the client to send data through the server to the gateway program are via the URL and the message body (in a METHOD=POST form). It is much more common for the client to use METHOD=POST but it is important that the web developer be familiar with all the routes. Passing data via the URL sometimes is necessary (in ISINDEX keyword searches) and sometimes a good idea, perhaps even with METHOD=POST.

To send data via the message body, use a form with METHOD=POST. This passes the data to the gateway program via the program's stdin. The CONTENT_LENGTH environment variable is set to the number of characters being sent; the CONTENT_TYPE variable is set to application/x-www-form-urlencoded.

Passing data via the URL has several variations:

A URL with ?[field]=[value]+[field]=[value]  such as
http://www.some.box/cgi-bin/name.pl?FirstName=Bill+SecondName=Elmer
is equivalent to the browser sending data to the server via a form and the METHOD=GET request, because the equal signs are unencoded. An encoded = sign is the character string %3D; the hexadecimal representation for the = character is 3D.
A URL with ?[data] with no displayable = characters  Even if there are encoded = characters (that is, %3D in the URL), the server treats this as an ISINDEX query. For example,
http://www.hydra.com/cgi-bin/sams/nothing.pl?20
http://www.hydra.com/cgi-bin/sams/nothing.pl?chapter%3D20
both are treated as ISINDEX queries. Recall that an ISINDEX query usually is a keyword search using a text engine such as WAIS or freeWAIS; the general form of this request follows:
http://machine/path/text-gateway-script.pl?keyword1+keyword2+keyword3+...
Note that the ISINDEX query passes data via the command line. Unlike other methods of passing data, ISINDEX data is not encoded by the server before it is passed to the gateway program. No special decoding is necessary. Note that the + character, separating the keywords, was not encoded into its hexadecimal equivalent of %2B.

Tip
Although it is possible to create an HTML file with the <isindex> tag, there is no point; it will do nothing because an ISINDEX query is self-referencing (it calls itself). In other words, an ISINDEX screen should be generated by the script that also includes the code to perform the query.

A URL with extra path data  With this method, immediately following the gateway program name, information is appended in the format of a data path:
http://www.hydra.com/cgi-bin/sams/nothing.pl/Bill/Elmer/
After the server finds the gateway program, it puts everything that follows into the PATH_INFO environment variable. With the preceding URL, PATH_INFO contains /Bill/Elmer/.

Before the Server Passes the Data Encoding

With the exception of ISINDEX, the data first is encoded by the server: spaces are changed to plus signs (+), certain keyboard characters are translated to their hexadecimal equivalent (represented as %[hex equivalent]) (for example, a ! becomes %3D), and fields within forms are concatenated with &. As an example, if a form contains:

Field 1<INPUT NAME=FIELD1> Field 2<INPUT NAME=FIELD2>

and data such as 1 !@#$% and 2 ^&*()_+| are input for fields one and two, respectively, the server encodes the data into the following string:

FIELD1=1+%21@%23%24%25&FIELD2=2+%5E%26*%28%29_%2B%7C

Notice that

The fields are separated by the unencoded &.
With each field, an unencoded = separates the field name input form and the data.
Spaces within the field data are translated to +.
Certain other keyboard characters are encoded, as mentioned, to %[hex equivalent].
The protocol designers decided to use readable characters only in the encoding scheme for clarity and ease of use; no high-end ASCII (unprintable) characters can appear.

How the Server Passes the Data to the Gateway Program

After the server receives the data, it has three ways to send that data to the gateway program:

Via the gateway program's stdin.  If the REQUEST_METHOD is post, the server first encodes the data as described previously and then sends it to the gateway as stdin. In a UNIX Shell, you can simulate this on the command line by creating a file with the data and running the script as this:

$ test-cgi.sh < test.data

It is important to note that there is no end-of-file terminating the data. The CONTENT_LENGTH variable is set to the number of characters in the data stream automatically by the HTTP protocol, and the script must include code to read only that amount of data from the stdin datastream.

Via the command line.  In Perl, the statement

read(stdin, $input_line, $ENV{CONTENT_LENGTH})

properly puts the stdin data into the variable $input_line as a command-line argument without encoding. The REQUEST_METHOD is GET and the server recognizes the incoming data as an ISINDEX query. The server passes the data onto the gateway program as a command-line argument without encoding the data. This is the same as running the script on the shell command line as the following:

$ test-cgi.sh arg1 arg2 arg3 . . .

Via the server's environment variables.  Recall the discussion of environment variables at the start of this chapter. Any variables set by the client also are passed along by the server to the gateway program. To test the script on the command line with environment variables, the variables first must be set. How this is done depends on the type of shell being used. In the Bourne Shell, for example,

$ QUERY_STRING=FNAME\=foo\&LNAME\=bar
$ export QUERY_STRING
$ echo $QUERY_STRING
$ FNAME=foo&LNAME=bar

sets the QUERY_STRING variable to FNAME=foo&LNAME=bar for testing with a script. Note that a user, when sending a browser to a URL of the form http://www.some.box/cgi-bin/test.pl?foo, is setting the QUERY_STRING variable to foo. Similarly, the URL http://www.some.box/cgi-bin/test.pl/foo sets the PATH_INFO variable to foo. Often, the developer will test GET methods via a browser instead of operating on the command line.

Code Sample: The Print Everything Script

To aid the developer in understanding how data flows between the client, server, and gateway, Listing 20.3 shows a simple script, in both Bourne and Perl, for testing the various data-passing methods.


Listing 20.3. A Bourne Shell script to demonstrate GET and POST methods.
#!/bin/sh
echo "Content-type: text/html"
echo
progname=print-everything.sh
action=cgi-bin/bourne/$progname

if [ $# = 0 ]
then
echo "<HEAD><TITLE>The Print Everything Form</TITLE><ISINDEX></HEAD><BODY>"

echo "GET form:"
echo "<FORM METHOD=GET ACTION=/$action>"
echo "Field 1<INPUT NAME=FIELD1>"
echo "Field 2<INPUT NAME=FIELD2>"
echo "<INPUT TYPE=submit VALUE=SUBMIT>"
echo "</FORM>"
echo "POST form:"
echo "<FORM METHOD=POST ACTION=/$action>"
echo "Field 1<INPUT NAME=FIELD1>"
echo "Field 2<INPUT NAME=FIELD2>"
echo "<INPUT TYPE=submit VALUE=SUBMIT>"
echo "</FORM></BODY>"

case "$REQUEST_METHOD" in
GET)  echo "You made a GET Request<BR>" ;;

POST) read input_line
      echo "You made a POST Request passing:<BR>"
      echo " $input_line<BR>"
      echo "to <I>stdin</I><BR>" ;;
*)    echo "I don't understand the REQUEST_METHOD: $REQUEST_METHOD<BR>";;
esac

else
echo "<HEAD><TITLE>The Print Everything Form</TITLE><ISINDEX></HEAD><BODY>"
echo "GET form:"
echo "<FORM METHOD=GET ACTION=/$action>"
echo "Field 1<INPUT NAME=FIELD1>"
echo "Field 2<INPUT NAME=FIELD2>"
echo "<INPUT TYPE=submit VALUE=SUBMIT>"
echo "</FORM>"
echo "POST form:"
echo "<FORM METHOD=POST ACTION=/$action>"
echo "Field 1<INPUT NAME=FIELD1>"
echo "Field 2<INPUT NAME=FIELD2>"
echo "<INPUT TYPE=submit VALUE=SUBMIT>"
echo "</FORM></BODY>"
echo "This is an <B>ISINDEX</B> query:<BR>"
echo "and you input: $*"
fi

echo "<PRE>"
echo "REQUEST_METHOD:  $REQUEST_METHOD"
echo "Command line arguments:  $*"
echo "QUERY_STRING: $QUERY_STRING"
echo "PATH_INFO:    $PATH_INFO"
echo "</PRE>"
echo "<HR>"
echo "back to <A HREF=$progname>Print Everything</A><BR>"

Run this script, and the screen shown in Figure 20.3 appears.

Figure 20.3 : The input screen for the print_everything.sh script.

The reader might want to try this script with input such as the following:

In the browser's Document URL input field, follow these steps:

  1. Put extra path info after the URL.
  2. Put [field]=[data] after the URL.
  3. Put [data]%3D after the URL.
  4. Put [data] with either = or %3D after the URL, and put data into the GET or POST input fields and click Submit for that field.
  5. Add other environment variables to the output screen.

After trying different types of input or modifying the script, the developer should have a better feel for how the server looks at the incoming data.

To begin your transition to Perl, Listing 20.4 shows a version of print_everything in Perl.


Listing 20.4. The print_everything.pl code.
#!/usr/local/bin/perl
#
#  print_everything.pl
#
print "Content-type: text/html\n\n";
$progname = "print_everything.pl";
$action= "cgi-bin/bourne/$progname";
if(@ARGV == 0){     print "<HEAD><TITLE>The Print Everything Form</ÂTITLE><ISINDEX></HEAD><BODY>";

     print "GET form:";
     print "<FORM METHOD=GET ACTION=/$action>";
     print "Field 1<INPUT NAME=FIELD1>";
     print "Field 2<INPUT NAME=FIELD2>";
     print "<INPUT TYPE=submit VALUE=SUBMIT>";
     print "</FORM>" ;
     print "POST form:";
     print "<FORM METHOD=POST ACTION=/$action>";
     print "Field 1<INPUT NAME=FIELD1>";
     print "Field 2<INPUT NAME=FIELD2>";
     print "<INPUT TYPE=submit VALUE=SUBMIT>";
     print "</FORM></BODY>" ;

if($ENV{REQUEST_METHOD} eq "GET")
  {   read(stdin, $input_line, $ENV{CONTENT_LENGTH});
      print "You made a GET Request<BR>";
      print "passing:  $input_line<BR>";
      print "to <I>stdin</I><BR>" ;
  }

elsif($ENV{REQUEST_METHOD} eq "POST")
  {   read(stdin, $input_line, $ENV{CONTENT_LENGTH});
      print "You made a POST Request passing:<BR>";
      print " $input_line<BR>";
      print "to <I>stdin</I><BR>" ;
  }
else
  {   print "I don't understand the REQUEST_METHOD: $REQUEST_METHOD<BR>";}} #end Âargv if test

else   # in case command-line argument(s) given {
     print "<HEAD><TITLE>The Print Everything Form</TITLE><ISINDEX></HEAD><BODY>";
     print "GET form:";
     print "<FORM METHOD=GET ACTION=/$action>";
     print "Field 1<INPUT NAME=FIELD1>";
     print "Field 2<INPUT NAME=FIELD2>";
     print "<INPUT TYPE=submit VALUE=SUBMIT>";
     print "</FORM>" ;
     print "POST form:";
     print "<FORM METHOD=POST ACTION=/$action>";
     print "Field 1<INPUT NAME=FIELD1>";
     print "Field 2<INPUT NAME=FIELD2>";
     print "<INPUT TYPE=submit VALUE=SUBMIT>";
     print "</FORM></BODY>";
     print "This is an <B>ISINDEX</B> query:<BR>";
     print "and you input: @ARGV ";
}

print "<PRE>";
print "REQUEST_METHOD:  $ENV{REQUEST_METHOD}\n";
print "Command line arguments:  @ARGV\n";
print "QUERY_STRING: $ENV{QUERY_STRING}\n";
print "PATH_INFO:    $ENV{PATH_INFO}\n";
print "</PRE>\n";
print "<HR>";
print "back to <A HREF=$progname>Print Everything</A><BR>";
exit;

If the developer wants to see all the environmental variables, not just those related to the Web transaction, it is a simple matter in Perl, as Listing 20.5 shows. Figure 20.4 shows the output of this script.

Figure 20.4 : The output of the dump_vars script.


Listing 20.5. dump_vars: A short Perl program to list the environmental
variables.
#!/usr/local/bin/perl
#
#  dump_vars : dump all the (sorted by name) Enviromental Variables
#              formatting them nicely in HTML
#######################################################
print "Content-type: text/html\n\n";
print "<ul>";  # an unordered bullet list
foreach (sort keys %ENV)  {
     print "<li> Env Var key: $_ value $ENV{$_}";
}
print "</ul>"; # end the bullet list
exit 0;

Think about Figure 20.4 for a moment. Why are only Web-related environmental variables showing up, when I specified that the entire %ENV array be listed? The answer goes back to the fundamentals; the environmental variables shown belong not to a specific user, but to the CGI process owner (typically nobody). The process owner had no environmental variables set up before the script started-hence, the minimal list you see in this figure.

A gateway program must begin its output with a proper header that the server will understand. The server recognizes three headers (at this time):

Content-type: [type]/[subtype]  This was discussed at the beginning of this chapter. For the most part, the developer will be using the following in Perl:

print "Content-type: text/html\n\n";

Location: [URL]  This causes the server to ignore any trailing data and perform a redirect-that is, it tells the client to retrieve the data specified by the URL as if the client originally had requested that URL. The code

print "Location: http://www.some.box.com/the_other_file.html\n\n";

for example, causes the server to tell the client to retrieve the_other_file.html. Here is a brief Perl script that takes advantage of the Location header:

#!/usr/local/bin/perl
$filename =  'ls -t /web/updates/ | head -1';
print "Location: http://www.some.box/$filename\n\n";
exit;

In this sample, the value of the $filename variable is the most recently modified file in the specified directory. Using the Location header directs the client to retrieve that file, even though the client has no prior knowledge of which file that is.

Status: [message string]  This causes the server to alter the default message number and text specified that it normally would return to the client:

print "Status: 305 Document moved\n";

Note that only a certain range of numbers is valid here: 200-599. Anything else causes an error.

Note
No-parse header scripts are gateway programs in which file names historically began with nph-; newer servers have dropped that requirement. The server does not parse or create its own headers; it passes the gateway output directly to the client untouched. The gateway output must begin with a valid HTTP response:
print "HTTP/1.0 200 OK\n";
print "Content-type: text/html\n\n";
One reason why a developer might want to use nph- scripts is that, because the gateway doesn't parse the output, the client receives a response quicker. Of course, other factors could affect the response time. Another reason to use nph- scripts is if you want to display a series of images or text strings serially to the client, each one overlaying the previous item (a poor man's animation); this is the last code example presented in this chapter.

Manipulating the Client Data with the Bourne Shell

The Bourne Shell is great for doing UNIX-specific activities, but is a weak tool for web development because it lacks the text-manipulation facilities of Perl. As an example, Listing 20.6 presents a simple Bourne Shell script, called by a METHOD=POST form, that separates the fields into shell variables.


Listing 20.6. A Bourne Shell script to handle METHOD=POST.
#!/bin/sh
echo "Content-type: text/html"
echo

echo "<HEAD><TITLE>Display Form Variables</TITLE></HEAD>"
read buffer
echo $buffer > /tmp/awk.temp.$$

awk ' {elements =  split($0, fields, "&") }
      {print "number of elements = " elements}
      {print "<P>"}

      { for (elements in fields)
            { junk = split(fields[elements], value, "=")
              printf "value of record =  %s", value[2]
              print "<BR>" }
      } '  /tmp/awk.temp.$$

rm /tmp/awk.temp.$$
echo "<BR>"

The output from this script still will be encoded. UNIX programs such as sed or tr can be used to decode the data, and the gnu version of awk (gawk) does have a substitution function. Things are getting a bit unwieldy at this point, however, and the developer does not need to reinvent the wheel. There are easier ways to accomplish these tasks: with Perl.

Note
If you're unable or unwilling to use Perl, a package that allows you to access and decode form variables and still use shells such as Bourne does exist. The Un-CGI package decodes form variables and places them in the shell's environment variables. (See note)

Manipulating the Client Data with Perl

As you can see from Listing 20.6, there's a bit of work to be done before the developer can get to the client's data and accomplish real tasks.

Fortunately, Larry Wall created the Practical Extraction and Reporting Language (Perl). Perl looks like C but subsumes a lot of features originally found in utilities such as sed, awk, and tr. Although it doesn't allow you to get as close to the system as C, it is an excellent choice to quickly develop complex CGI programs. Perl's strength is precisely what most CGI programs need-powerful and flexible text-manipulation facilities. For these reasons, Perl has become a popular software choice for CGI programming.(See note)

To decode a variable in Perl, for example, you can use code such as the following (from cgi-handlers.pl):

tr/+/ /;
s/%(..)/pack("c",hex($1))/ge;

These two simple lines decode all the encoded characters in a string variable in one step.

This completes the discussion of the CGI fundamentals. Now I'll move onto real-life code that illustrates how the simpler pieces fit together to form useful applications.

To Imagemap or Not to Imagemap

Users without access to full graphical Web interfaces often use line browsers such as Lynx or W3. Imagemaps do not appear on line browser terminals; the word [IMAGE] appears in Lynx, but it is not clickable. Therefore, it is important to cater to the Lynx users of the world when developing an imagemap front end. How do you distinguish Lynx and its peers from the Mosaics and Netscapes of the world? This will become clear when I show you bimodal.pl, which uses a little environmental variable trick.

Code Walkthrough: bimodal.pl

The program bimodal.pl is so named because it offers two modes: an imagemap and a standard textual link interface. It queries the environmental variable ENV{HTTP_USER_AGENT} and switches to the mode appropriate to the user's browser. If a line browser such as Lynx is detected, it would be inappropriate to display an imagemap. The Lynx user would be stymied with an imagemap; the image would display as [IMAGE] and there would be no clickable region. Imagemap therefore would be functionally useless to a Lynx user. The program outsmarts these difficulties and reverts to text links in such cases. For graphical browsers such as Mosaic or Netscape, the imagemap is displayed. Listing 20.7 shows the bimodal.pl code.


Listing 20.7. The Perl script bimodal.pl queries the HTTP_USER_AGENT
environmental variable.
#!/usr/local/bin/perl
#
#  bimodal.pl
#
#  First things first, supply the MIME header

print "Content-type: text/html\n\n";

#  If line-browser detected, print the textual HTML.  Else,
#  user has a GUI browser and I use the imagemap.

if ( $ENV{HTTP_USER_AGENT} =~ /Lynx|LineMode|W3/i ) {
#
print <<EndOfGraphic;

<TITLE> What's for Dinner? - Text version</TITLE>
<H1>What's for Dinner? - Text version</H1>

<A HREF=http://www.some.box/enchilada.html>Enchilada</A> |
<A HREF=http://www.some.box/hamburger.html>Hamburger</A> |
<A HREF=http://www.some.box/kabob.html>Shish Kabob</A> |
<A HREF=http://www.some.box/hotdog.html>Hot Dog</A> |
<A HREF=http://www.some.box/spag.html>Spaghetti</A>
<BR><HR>

EndOfGraphic
#  The label EndOfGraphic is reached.  Now the "else" part of the if
#   statement takes over - to present GUI browsers with an imagemap.
}

else {
print <<EndOfImap;

<title>What's for Dinner? - Graphic version</title>
<H1>What's for Dinner? - Graphic version</H1>
<A HREF="http://www.some.box/cgi-bin/imagemap/dinner.map">
    <img src="http://www.some.box/icons/dinner.gif" ismap>
</a>
<HR>
<A HREF=http://www.some.box/sams/>Index of WDG Web Pages</A>
EndOfImap
}
exit;

Tip
The bimodal.pl script uses a trick common to the original Bourne Shell and Perl that can be very handy when a developer needs to output lots of HTML. The following code prints the HTML block exactly as is until it encounters the terminating string SomeLabel:
print <<SomeLabel;
<HTML-block-line-1>
<HTML-block-line-2>
<HTML-block-line-3>
<HTML-block-line-4>
,,,
<HTML-block-last-line>
SomeLabel
This technique is very handy because it produces very readable code with a minimum of fuss. The alternative-outputting HTML with multiple Perl print <some-HTML> statements-can cause headaches because special characters within the <some-HTML> string must be escaped in order to print properly, or, more fundamentally, in order for the Perl program to run without syntax errors. As a simple example, if I want to output the following HTML in a Perl CGI program,
<A HREF="http://is-2.stern.nyu.edu/">The InfoSys Home Page</A>
I can use a Perl print statement and escape the interior quotation marks by using this code:
print "<A HREF=\"http://is-2.stern.nyu.edu/\">The InfoSys Home Page </A>";
Or, I can say
print <<EndHTML;
<A HREF="http://is-2.stern.nyu.edu/">The InfoSys Home Page</A>
EndHTML

Caution
In Perl 5, there is a hidden danger using this technique:
print <<some-label;
HTML-BLOCK
some-label
An unescaped @ character inside the HTML-BLOCK crashes the program. In any Perl version, another trap must be avoided in this construction: The terminating string, some-label, must appear flush left without any leading white space. Failure to place some-label flush left results in runtime errors, even though it passes a syntax check.

Figure 20.5 shows the result of bimodal.pl executing from a GUI Web browser-Mosaic 2.5 for X.

Figure 20.5 : Bescause a GUI Web browser is used, bimodal.pl displays an imagemap front end.

Figure 20.6 shows the result of bimodal.pl executing from a line Web browser-the University of Kansas's Lynx. (See note) The script bimodal.pl avoids showing the imagemap, which would have no meaning to a Lynx user and reverts to a standard textual hyperlink front end that has the same functionality.

Figure 20.6 : A line browser's view of the Web site shown in Figure 20.5.

Now look at another useful example. Suppose that I want to fetch one or more documents from a Web server, but only if the modification date and time (the timestamp) has changed from the last time I checked it.

Here's how I can do it: I can use Perl to set up a TCP client socket connection between my machine (the client) and the Web server and send the server a HEAD method to get metainformation about the files (specifically, their timestamps) sent in the socket back to my machine. I then consult a database of the file timestamps and compare my database to the newly received information. If they match, the file in question was unchanged and I take no action. If they don't match, I fetch the contents of the file to my local machine.

The code in Listing 20.8, get_head,(See note) follows this scheme. The code to set up a socket connection is fairly dense but, thankfully, it's all in the important book Programming Perl, by Schwartz and Wall, published by O'Reilly & Associates, 1991. System V-style UNIX, such as Solaris 2.X and SGI, will need the file socket.ph to run this code. Also note the use of the Perl dbmopen function to keep a database of file timestamps.


Listing 20.8. The get_head program to demonstrate sockets and the HEAD method.
#!/opt/bin/perl
#
#  get_head : uses HEAD method to test timestamp modification on a group
#             of remote files,
#             and saves locally those files that were modified since
#             the last time we checked.
#
#  First, define some useful HTTP Protocol Status Codes and Messages
#  in two associative arrays.
#
%OkStatusMsgs = (
  200, "OK 200",
  201, "CREATED 201",
  202, "Accepted 202",
  203, "Partial Information 203",
  204, "No Response 204",
);
%FailStatusMsgs = (
  -1,  "Could not lookup server",
  -2,  "Could not open socket",
  -3,  "Could not bind socket",
  -4,  "Could not connect",
  301, "Found, but moved",
  302, "Found, but data resides under different URL (add a /)",
  303, "Method",
  304, "Not Modified",
  400, "Bad request",
  401, "Unauthorized",
  402, "PaymentRequired",
  403, "Forbidden",
  404, "Not found",
  500, "Internal Error",
  501, "Not implemented",
  502, "Service temporarily overloaded",
  503, "Gateway timeout ",
  600, "Bad request",
  601, "Not implemented",
  602, "Connection failed (host not found?)",
  603, "Timed out",
);

$outfile = "/home/mginsbur/filecontents.txt";  # we'll append all changed
                                          &nbs p;    # files to this local file.

open(OUTFILE,">>$outfile") || die "cannot open $outfile \n";

$baseurlpath = "/usr/local/aries/web/testsock";
$server = "http://edgar.stern.nyu.edu/testsock";
chdir($baseurlpath) || die "cannot chdir to $baseurlpath \n";

foreach $f (<*.html>) {

   print "Processing file $server/$f \n";
   dbmopen (%time_stamps,"timedb",0666);  # open the database of timestamps
   $status = &Check_URL ("$server/$f");
   print "Status: $status\n";
   dbmclose (%time_stamps);

}

exit 0;
###################
#   Subroutines   #
###################

sub Check_URL {

local($URL) = @_;

if ($URL !~ m#^http://.*#i) {
  print "wrong format http!\n";
  return;
}
else {     # Get the host and port

  if ($URL =~ m#^http://([\w-\.]+):?(\d*)($|/(.*))#) {
    $host = $1;
    $port = $2;
    $path = $3;
  }
  if ($path eq "") {
     $path = '/'; }  # give a "/" if none supplied in the path
  if ($port eq "") {
     $port = 80; }   # port 80 is standard

  $path =~ s/#.*//;    # Delete name anchor

}

#####################################################################
# The following is largely taken from the 'Programming Perl' book,  #
# Schwartz and Wall, on a sample Perl TCP/IP Client:  pages 342-344.#
#####################################################################

$AF_INET = 2;
$SOCK_STREAM = 1;

$sockaddr = 'S n a4 x8';

chop($hostname = 'hostname');

($name,$aliases,$proto) = getprotobyname('tcp');
($name,$aliases,$port) = getservbyname($port,'tcp') unless $port =~ /^\d+$/;
($name,$aliases,$type,$len,$thisaddr) = gethostbyname($hostname);
if (!(($name,$aliases,$type,$len,$thataddr) = gethostbyname($host))) {
  return -1;
}

$this = pack($sockaddr, $AF_INET, 0, $thisaddr);
$that = pack($sockaddr, $AF_INET, $port, $thataddr);

# Make the socket filehandle.

if (!(socket(S, $AF_INET, $SOCK_STREAM, $proto))) {
  $SOCK_STREAM = 2;
  if (!(socket(S, $AF_INET, $SOCK_STREAM, $proto))) { return -2; }
}

if (!(bind(S, $this))) {      # bind locally
  return -3;
}

if (!(connect(S,$that))) {    # connect remotely
  return -4;
}

select(S);
$| = 1;           #  unbuffer the i/o because we have 2 filehandles
select(STDOUT);

print S "HEAD $path HTTP/1.0\n\n";  # send the web server a HEAD request
#print S "GET $path HTTP/1.0\n";    # could have used a CONDITIONAL GET
#print S "If-Modified-Since: Monday, 03-Jun-96 14:57:50 GMT\n\n";
#
$response = <S>;

($protocol, $status) = split(/ /, $response);

print "Response from HEAD request is: $response \n";
#
#  check the Response.  If it's OK, get the modification time and
#  compare that to the entry in our timestamp database.  If they
#  match, set the return value to 1.  Otherwise, set the return value
#  to 0 and use a GET to get the contents and write to a file.
#
for ($i = 0 ; $i < 100; $i++) {  # give the response a chance to form
    $response = <S>;
    print "$response";       # display it on STDOUT
    if ($response =~ /Last-Modified/i) {  # expect Last-Modified
           ($junk, $time) = split (/: /,$response);
           if (!(($time_stamps{$path})) || ($time_stamps{$path} ne $time)) {
               $time_stamps{$path} = $time;
               close (S);
               &write_file_to_disk;  # if file changed, save it to disk
               return 0;   # 0 means the file has been changed since
                           # the last time we built a timestamp entry for it.
           }
    }
}

close(S);  # close the Socket

return 1;   # 1 means the file has not been changed.

}
#
# If the database timestamp does not match the actual file modification
# timestamp, write its contents to local disk using the c-program
# http_get.  (see Listing 20.17 for the source of http_get).
#
sub write_file_to_disk {

   print "Capturing File ... \n";
   $contents = `/home/mginsbur/bin/http_get $server/$f`;
   print "Captured: $server/$f successfully ... \n";
   print "Appending $server/$f to file $outfile ... \n";
   print OUTFILE "$contents";
   print "$server/$f has been written to file $outfile. \n\n
";
}

This code is best illustrated with an example. Suppose that I have a directory on a Web server corresponding to the URL http://edgar.stern.nyu.edu/testsock.

Here is a listing of the files in that directory:

-rw-r--r--   1 mginsbur staff      50284 Jun 27 17:21 analog.html
-rw-rw-r--   1 mginsbur staff       1286 Jun 27 17:29 hydrant.html

Let's say that I run the program for the first time from the directory ~/test. Because it is the first time, no timestamp database has been built yet and the files are all new. Therefore, I capture both of them to a local file, as shown in Listing 20.9.


Listing 20.9. get_head: First program execution.
Processing file http://edgar.stern.nyu.edu/testsock/analog.html
Response from HEAD request is: HTTP/1.0 200 Document follows

Date: Fri, 28 Jun 1996 20:30:27 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:21:38 GMT
Capturing File . . .
Captured: http://edgar.stern.nyu.edu/testsock/analog.html successfully . . .
Appending http://edgar.stern.nyu.edu/testsock/analog.html to file /home/mginsbur/Âfilecontents.txt . . .
http://edgar.stern.nyu.edu/testsock/analog.html has been written to file /home/Âmginsbur/filecontents.txt.

Status: 0
Processing file http://edgar.stern.nyu.edu/testsock/hydrant.html
Response from HEAD request is: HTTP/1.0 200 Document follows

Date: Fri, 28 Jun 1996 20:30:28 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:29:51 GMT
Capturing File . . .
Captured: http://edgar.stern.nyu.edu/testsock/hydrant.html successfully . . .
Appending http://edgar.stern.nyu.edu/testsock/hydrant.html to file /home/mginsbur/Âfilecontents.txt . . .
http://edgar.stern.nyu.edu/testsock/hydrant.html has
been written to file /home/Âmginsbur/filecontents.txt.
Status: 0

Now, I run the program a second time without altering any of the files on the Web server. Study Listing 20.9 and see whether you can follow what action the program will take. Listing 20.10 shows the output.


Listing 20.10. get_head: Second program execution.
Processing file http://louvain.ny.jpmorgan.com/testsock/analog.html
Response from HEAD request is: HTTP/1.0 200 Document follows

Date: Fri, 28 Jun 1996 20:31:05 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:21:38 GMT
Content-length: 50284

Status: 1
Processing file http://louvain.ny.jpmorgan.com/testsock/hydrant.html
Response from HEAD request is: HTTP/1.0 200 Document follows

Date: Fri, 28 Jun 1996 20:31:05 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:29:51 GMT
Content-length: 1286
Status: 1

Sure enough, because neither file was modified since the last time I collected their timestamp information, the program takes no action and returns a status code of 1 for each file.

It's time to complete the picture by changing one of the file's timestamps. I can do this easily with the UNIX touch command:

touch /usr/local/edgar/web/testsock/hydrant.html

Now hydrant.html has been updated; analog.html's timestamp still matches the original information collected in the program's first run. The directory listing of /usr/local/edgar/web/testsock has been correspondingly updated and two new files are present:

-rw-r--r--   1 mginsbur staff      50284 Jun 27 17:21 analog.html
-rw-rw-r--   1 mginsbur staff       1286 Jun 28 16:31 hydrant.html
-rw-rw-r--   1 mginsbur staff          0 Jun 28 16:30 timedb.dir
-rw-rw-r--   1 mginsbur staff       1024 Jun 28 16:30 timedb.pag

The two timedb.* files compose the timestamp database that the program get_head creates the first time it is run and updates every subsequent time it is run.

I run the program for a third time and the output in Listing 20.11 appears.


Listing 20.11. get_head: Third program execution.
Processing file http://louvain.ny.jpmorgan.com/testsock/analog.html
Response from HEAD request is: HTTP/1.0 200 Document follows

Date: Fri, 28 Jun 1996 20:32:12 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:21:38 GMT
Content-length: 50284

Status: 1
Processing file http://louvain.ny.jpmorgan.com/testsock/hydrant.html
Response from HEAD request is: HTTP/1.0 200 Document follows

Date: Fri, 28 Jun 1996 20:32:12 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Fri, 28 Jun 1996 20:31:43 GMT
Capturing File . . .
Captured: http://louvain.ny.jpmorgan.com/testsock/hydrant.html successfully . . .
Appending http://louvain.ny.jpmorgan.com/testsock/hydrant.html to file /home/Âmginsbur/filecontents.txt . . .
http://louvain.ny.jpmorgan.com/testsock/hydrant.html has been written to file
/Âhome/mginsbur/filecontents.txt.
Status: 0

As expected, the file analog.html is checked against the timestamp database and, because it has not been modified, no action is taken. The other file, hydrant.html, was updated and new contents are fetched to the local file system. One obvious use for the techniques presented in get_head.pl is in the case of a Web-crawling search program; this process traverses the Web looking for new content to index. If it can effectively check the timestamp of files it encounters, it does not need to download every single file it finds to index. It can incrementally index and save a lot of network and CPU time.

Note
You might have noticed in the get_head.pl code listing a commented-out section near the HEAD method. This is a CONDITIONAL GET method, which is very similar logically. If a document does not meet the criteria specified in the CONDITIONAL GET, a status code of 304 is returned, which means that the document was not modified in that timeframe. If it was modified in that timeframe, the contents immediately are fetched by the script. The HEAD method, by contrast, generates a 200 (OK) message if all is well, and it's up to the script to do logical comparisons on the file metainformation received from the server.

The concepts presented in the Perl socket application are very powerful and well worth study. As the Web grows, so does the noise-to-signal ratio, and filtering mechanisms become essential. The idea of selectively fetching only new documents is appealing to newsfeed applications, text-index searching, and generalized agent technology. A user can launch an application, for example, to fetch only new documents from a favorite Web site. Such an agent quite easily could automatically update the user's browser bookmarks file.

An Integrated E-Mail Gateway Application

One of the advantages of Perl from the developer's perspective is that a small building-block program easily can be customized and integrated into a bigger application.

Consider the following real-life design problem stemming from a telecommunications class final project at the NYU Stern School of Business. A group of students wanted to write a set of Perl CGI programs to provide on-line corporate recruiting, as shown in these steps:(See note)

  1. As a necessary preliminary step, the students create HTML resumés and place them in a common directory.
  2. The first CGI program is launched by the resumé system administrator, resume_builder.pl, automatically creating a table of contents linking to each resumé. The program is smart enough to avoid creating links to files that are not student resumés.
  3. The output of resume_builder.pl is resume_toc.html, which provides the corporate recruiter with an Action button. If the recruiter clicks this button, a picklist of all the resumés appears (built at request time by resume_form.pl) and the recruiter can click one or more names to receive a broadcast e-mail message.
  4. The third CGI program, resume_mail.pl, is the e-mail gateway back end to resume_form.pl. This program is the glue between the picklist and the actual UNIX mail program.

The system is making the implicit assumption that between steps 2 and 3, the recruiter has scanned the resumés and located the most promising ones.

I think it will be instructive to see the code that went into resume_builder.pl, resume_form.pl, and resume_mail.pl. Listing 20.12 provides this code.


Listing 20.12. resume_builder.pl.
#!/usr/local/bin/perl
#
# resume_builder.pl
#
# Resume Project
#
# this program will read in the directory and output
# HTML links to each valid resume (studentname.html is valid).
#
$site =  "www.stern.nyu.edu ";
$basepath = "/usr/users/mark/book/src";
$output = "$basepath/resume.toc.html";
$link   = "$basepath/index.html";
$hits = $misses = 0;
$prefix = "<dd><A HREF=\"http://www.stern.nyu.edu/~lma/project/";
$suffix = "\">";

open(OP, ">$output")    || die "cannot open the OUTPUT file";

@my_array = 'ls';    # set an array to the unix output of 'ls'

&init;  # write the header HTML lines

#
#  Now loop through and pull out only the valid resumes which are of the form
#  (name).html
#  Avoid this program's output (resume.toc.html), any pictures (*.pic) files,
#  and the special index.html file which is a symbolic link to resume_toc.html
#

for ($i=0; $i<=$#my_array; $i++)  {

($name,$ext) = split(/\./,$my_array[$i]);    # split xxxx.html on the period
                                          &nbs p;  # note the assumption that the file
                                          &nbs p;  # name has no extra
                                          &nbs p;  # embedded periods!

if  (($name =~ /resume/) || ($name =~ /index/) || ($name =~/pic/))  {
    $misses++;
    print "skipping $name.$ext \n"; }   # command-line info msg
else{
    $hits++;
    $combo = $prefix.$my_array[$i];
    print OP "$combo";
    $real_suffix = $suffix.$name."</a>";
    print "picking up $name resume \n";  # command-line info msg
    print OP "$real_suffix  </dd><br> \n";
}

}

print "\n $hits Hits and $misses Misses \n";  # closing info msg

&trlr;

close(OP)  || die "cannot close output";

#
#  build a symbolic link to index.html * if one does not yet exist *
#

if (-e $link)  {
}
else{
    'ln -s resume.toc.html index.html';
     print "$link symbolic link built \n";
}

exit 0;

#
#  init - outputs the Title and header and introductory msg
#

sub init{

print OP "<TITLE>WWW Resume Collection</TITLE><br>";
print OP "<H1>WWW Resume Collection</H1><br>";

print OP "Welcome to the NYU resume database.  ";
print OP  "It will match recruiters to qualified candidates. ";
print OP  "Recruiters can screen through our resume database and contact ";
print OP  "selected candidates via email by filling out a form. <p>";
print OP "<HR><b> Click on a name to view a resume. </b><br>";
print OP "<br>";
}

#
# trlr - outputs the trailing info and credits
#

sub trlr{

print OP "<br><br>If you wish to contact any of the people in our
database, you have the option to send them an email message.  To do
so, click <A HREF =
\"http://$site/~lma/project/resume_form.pl\"><B>CONTACT
FORM</B></a><p>"; print OP "<Hr>Thank you for using our database.<br>
We hope that you have found it useful.<p>";

print OP "<b>Project Team</b>";
print OP "<a href= \"http://$site/~pcheng\">Peter Cheng";
print OP "<a href= \"http://$site/~pliu\">Peggy Liu</a>";
print OP "<a href= \"http://$site/~lma\">Lisa Ma</a>";
print OP "<a href= \"http://$site/~hshayovi\">Heshy Shayovitz</a><p>";
print OP "<HR>";

Tip
The technique of defining an index.html symbolic link is very useful. If a user enters the resumé system and does not supply a file name, the server usually is configured to look for the file index.html (home.html is another popular choice). Thus, in Listing 20.12, I check to see whether index.html exists. If it does not yet exist, I build the symbolic link to the output of the program. This step is necessary only once, of course; hence the existence check.

The next program, resume_form.pl, builds the picklist of candidate resumés dynamically (see Listing 20.13). Its structure is quite similar to resume_builder.pl. Notice the high degree of modularity-the form is broken into rather small subroutines. The dynamic build of the picklist is separated into its own routine for easy readability and maintenance.


Listing 20.13. resume_form.pl.
#!/usr/local/bin/perl
#
#  resume_form.pl
#
print "Content-type: text/html\n\n";

@my_array = 'ls';    # set an array to the unix output of 'ls'

$site= "www.stern.nyu.edu ";
$prefix = "<A HREF=\"http://$site/~lma/project/";
$suffix = "\">";

&init;

&build_top_of_form;

&build_picklist;

&build_rest_of_form;

&trlr;

#

sub init{

print "<TITLE>WWW Resume Contact Form</TITLE><br>";
print "<H1>WWW Resume Contact Form</H1><br><HR>";
print "The following is a form which will allow you to send messages ";
print "to the resumes of the candidates that you have just viewed. ";
print "You have the option to send to multiple candidates from the ";
print "picklist by holding down the CONTROL or SHIFT keys and clicking on";
print "the desired names.<hr>";
}

#
#  build_top_of_form - write common form header, up to the point
#  where the list of resumes must be generated.
#

sub build_top_of_form{

print "<FORM METHOD=\"POST\" ";
print  "ACTION=\"http://$site/~lma/project/resume_mail.cgi\">";
print "<b> Contact Name: </b>";
print "<br>";
print "<INPUT NAME=\"cname\"><br>";
print "<b>Company: </b>";
print "<br>";
print "<INPUT NAME=\"Company\"><br>";
print "<b>Address: </b>";
print "<br>";
print "<INPUT NAME=\"Address\"><br>";
print "<b>Telephone #: </b>";
print "<br>";
print "<INPUT NAME=\"Tel\"><br>";
print "<b>Fax #: </b>";
print "<br>";
print "<INPUT NAME=\"Fax\"><br>";

print "<b>What is the subject of this message?</b>";
print "<br>";
print "<INPUT NAME=\"Subj\"><p>";

print "<b>Send to: </b><br>";
print "<SELECT NAME=\"resume\" size=7 MULTIPLE>";

}
#
#  Note:  the C for loop is quite unnecessary in Perl.  I could say
#  for (@myarray) and accomplish the same thing.
#
sub build_picklist{
for ($i=0; $i<=$#my_array; $i++)  {

     ($name,$ext) = split(/\./,$my_array[$i]);    # split xxxx.html on pd.

     if (($name =~ /resume/) || ($name =~ /index/) || ($name =~ /pic/))  {
     }

     else{
          print "<OPTION>$name";

     }   # end the If statement

}   # end the for loop
print "</SELECT><p>";
}

sub build_rest_of_form{
print "<b>Please type your message here: </b><br>";
print "<TEXTAREA NAME=\"message\" ROWS=10 COLS=50></TEXTAREA><p>";
print "<INPUT TYPE=\"submit\" VALUE=\"Send Message\">";
print "<p>";
print "<INPUT TYPE=\"reset\" VALUE=\"Clear Form\">";
print "</form><p>";
print "<hr>";
}

sub trlr{
print "<a href=\"http://www.stern.nyu.edu/~lma/project\">";
print "<img src=\"http://edgar.stern.nyu.edu/icons/back.gif\">";
print "Return to the Resume System</A>";
print "<HR>";
}

Two scripts down, one to go. I'll complete the trilogy with resume_mail.pl, which is the program taking the output of resume_form.pl (that is, the recruiter's name, company, telephone, fax, e-mail message, and recipient(s) list) and piping it to the UNIX mail program. Listing 20.14 contains the code.


Listing 20.14. resume_mail.pl.
#!/usr/local/bin/perl
#
#  resume_mail.pl
#
#

$mailprog = '/usr/ucb/mail ';
$mailsuffix = '@stern.nyu.edu';
$comma = ',';
#
require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl';   # modified cgi-lib.pl

# Print a title and initial heading and the Right Header.

&html_header("Mail Form");  # modified because html_header takes an arg.

$i = 0;

# Get the input
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});

# Split the name-value pairs
@pairs = split(/&/, $buffer);
#
#   The next code is equivalent to using the &parse_request subroutine
#  that comes with the cgi-lib.pl Perl toolkit.  The goal is to get a
# series of name-value pairs from the form.
#
foreach $pair (@pairs)
{
   ($name, $value) = split(/=/, $pair);

   # decode the values passed by the form
   $value =~ tr/+/ /;
   $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

   # Stop people from using subshells to execute commands
   $value =~ s/~!/ ~!/g;

   #
   #  build an array r_array composed of all the names on the recipient list.
   #

      if ($name eq "resume") {
           $r_array[$i] = $value;
           $i++;               }

     $recip = "";
#
#  Now build $recip - the valid string of recipients, delimited by commas
#  e.g. csmith@stern.nyu.edu,bjones@stern.nyu.edu,
#  the minor problem: this technique ends with a faulty final comma.
#
    for (@r_array) {
#
        $temp = $_.$mailsuffix.$comma;
        $recip = $recip.$temp;
        $temp = "";
   }

   substr($recip,-1,1) = "";  # get rid of comma at end.  Now $recip is fine.

   $FORM{$name} = $value;  # assoc. array for rest of the form.
}  # end for - each

# print "Final recipient List is $recip";   # uncomment this for debugging.

# Now send mail to $recip which is one or more students.
#
# Include form info plus info at end about the user's machine hostname and
# IP address.
#
open (MAIL, "|$mailprog -s \"$FORM{'Subj'}\" $recip ")
          || die "Can't open $mailprog!\n";

print MAIL "The contactname was $FORM{'cname'} from company $FORM{'Company'}\n";
print MAIL "has sent you the following message regarding your resume:\n\n";
print MAIL  "------------------------------------------------------------\n";
print MAIL "$FORM{'message'}";
print MAIL "\n------------------------------------------------------------\n";
print MAIL "Their fax: $FORM{'Fax'}\n";
print MAIL "Their tel: $FORM{'Tel'}\n";
print MAIL "Their addr: $FORM{'Address'}\n";
print MAIL "Their co: $FORM{'Company'}\n";
print MAIL "\n----S E N D E R  I N F O --------------------------------\n";
print MAIL "Recruiter at host: $ENV{'REMOTE_HOST'}\n";
print MAIL "Recruiter at IP address: $ENV{'REMOTE_ADDR'}\n";
close (MAIL);

&thanks;

exit 0;

#
#  Acknowledge mail
#
sub thanks{
print "<H2><TITLE>Mail Sent!</TITLE></H2><P>";
print "<B>Your mail has been sent.</B><br>";
print "<B>Thank you for using our resume database!</B><br>";
print "<hr>";
print "<a href=\"http://www.stern.nyu.edu/~lma/project\">";
print "<img src=\"http://edgar.stern.nyu.edu/icons/back.gif\">";
print "Return to the Resume System</A>";
print "<HR>";
}

Discussion of the Resumé Application

Starting from scratch, the entire application was built (by three novice programmers and one supervisor) in three days. This is a great advertisement for Perl and, more generally, the ease with which on-line applications can be built using CGI scripting. The system offers unlimited scope to grow (thousands of resumés conceivably could be stored in the base directory) and an excellent window by which corporate recruiters can interface with top students.

What's missing in the resumé-recruiter interface? Number one on my wish list is database functionality to permit search by keyword or other ad-hoc criteria-for example, "show me all students with programming skills in C and C++" or "show me all students who are graduating next term with foreign language proficiency in French or Spanish." This falls within the realm of database gateway programming and is discussed in Chapter 21.

Figure 20.7 shows the output of the resume_builder.pl program.

Figure 20.7 : The corporate recruiter travels to the URL.http://www.stern.nyu.edu-lma/project/ and sees a series of HTML links to student resumes, created by resume_builder.pl.

Figure 20.8 shows the screen corporate recruiters see after they submit the information in Figure 20.7. Now you have the opportunity to send an e-mail message to one or more people in the picklist.

Figure 20.8 : The recruiter selects two lucky students to broadcast an overture to-who knows-perhaps a highpaying job ?

Extending the Transaction: Serial Transmission of Many Data Files in One Transaction

Often the developer is not content with sending one MIME header and one body of data to the client. Suppose that I want to send a series of images to the client in a logical loop. This is where a Netscape MIME extension called x-mixed-replace proves useful. X-mixed-replace supports data transfer to the client in this general manner (shown in Perl syntax):

A.   print "Content-type: multipart/x-mixed-replace; boundary=$sep\n";
B.   print "\n--$sep\n";
B.   print "Content-type: type/subtype\n";  # fill in type/subtype
B.   print "Content-length: $len\n\n";
B.   print $buf;  # where $buf is the data; make sure to measure its length!

The first line, labeled A, always is required. A boundary delimiter $sep must be defined, but it doesn't matter which string is assigned to $sep.

Then, the programmer picks the appropriate MIME type and subtype to display and repeats the lines in block B as often as required. As the last line's comment indicates, it is important to measure $buf exactly before sending it to the client. The x denotes that this is an experimental MIME type. Unfortunately, it is not supported to mix different MIME types (for example, text and image or image and video) in one x-mixed-replace transmission. One Usenet reader complained that this was "pretty x-mixed-up," especially because standard MIME (SMTP) messages support multipart/mixed format. So for the time being at least, you're confined to sending a single data format in this manner. Also, until the scheme gains further acceptance, you need to require a Netscape browser to handle the data stream.

The program presented in Listing 20.15, nph-image, shows the x-mixed-replace notion in action. This script displays a series of photographs serially, in a logical loop, to the client. It is an NPH-script that does not use operating system buffering.


Listing 20.15. nph-image.
#!/usr/local/bin/perl
#
#  nph-image:  a no-parse-header script to display a series of jpeg photos
#             serially; it is referenced within an IMG SRC tag in x.html.
#             Its output is understood by Netscape clients.
##############################################################################
require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl';
$photo_dir = "/is-too/tisakowi/web/isweb/testsite/photos/*.jpg";
$type = "image/jpeg";  # or image/gif depending on the application
@photos = `ls $photo_dir`;  # assemble the photo array

$SIG{"ALRM"} = "exit";  # in case user hits STOP during the transmission
alarm 10*60;            # timing delay
#
#  set the delay between pictures from the Query String, otherwise
#  set it to 1 second.
#
if (defined($ENV{'QUERY_STRING'})) {
   $delay = $ENV{'QUERY_STRING'}; }  # try to get delay from Query String
else {
   $delay = 1; }

$sep = "=-+=-+=-+=MULTI__PART__SEPARATOR-+=-+=-+=";  # this is arbitrary

$| = 1;  # unbuffered i/o is important in nph-scripts

print "Content-type: multipart/x-mixed-replace; boundary=$sep\n"; # req'd.

$first = 1;

do {

foreach $f (@photos) {

if (!$first) {
     $first = 0; }
else {
     sleep($delay) }

&output($f); }

}
while (1);
print "\n--$sep--\n";  # this will never occur
                       # unless infinite loop broken.

#
#  subroutine output:  print out *exactly* the buffer needed
#   for each picture.  Measure the picture's length (normally
#   with a stat function, except with jpegs needed to do it
#    a clumsier way).
#
sub output {
local($file) = @_;

local($len);
print "\n--$sep\n";
open(FILE, $file) || die "Error finding file $file";
print "Content-type: image/jpeg\n";
#$len = (stat($file))[7];  # does not seem to work on jpegs
$line = `ls -al $file`;
@stuff = split(/\s+/,$line);
$name = $stuff[2];  # primitive nametag
$len = $stuff[3];  # perl - got the length

print "Content-length: $len\n\n";

read(FILE, $buf, $len);
close(FILE) || die "cannot close file $file";
print $buf;
for ($a=0; $a<20; ++$a) {print "\n";}
}

Code Discussion: nph-image

This program sets up an infinite loop to show all the *.jpg photographs in a given directory, using the general x-mixed-replace scheme illustrated earlier.

The Perl statement

$|=1;

unbuffers the I/O-the default operating system buffering is not used. This is done to avoid images building up in a buffer and then being released all at once, confusingly, to the client desktop. Another interesting feature is the use of the statement

$SIG{"ALRM"} = "exit";

This traps the signal sent by the user clicking the Stop button in the Netscape browser. Without this trap, it might be very difficult to stop the constant stream of rotating images, and the user might even have to take the drastic step of killing the browser. Hopefully, with this trap, the Stop button will halt the script in a reasonable amount of time.

The other requirement that is important to note is the fact that I must measure each image, in bytes, before writing it to the client's stdout. Otherwise, images can overlay sections of the preceding ones incompletely and haphazardly. As you see in Listing 20.15, the line

#$len = (stat($file))[7]; # does not seem to work on jpegs

is commented out. This is the simplest way to get a length, which I use on *.gif images, but the function did not work for me on *.jpgs. I had to use a workaround as shown in the code. At any rate, after the length is known, exactly that amount is read into an input buffer and then is written out. The result is a smooth series of images (thanks to the unbuffered I/O and the care taken to measure image lengths). After the program executes, it pushes image data at the client indefinitely, and this is a significant network load. A more sensible approach is to end the image rotation after a certain time interval or maximum number of images.

Figure 20.9 shows the URL http://edgar.stern.nyu.edu/mgtest/x.html shortly after it is loaded into the Netscape client.

Figure 20.9 : An infinite series of repeating images is presented, with each image replacing the preceding one.

To complete the discussion of this animation application, Listing 20.16 shows the first few lines of the file x.html; note how the nph-image is embedded inconspicuously in the HTML img src tag near the top.


Listing 20.16. x.html.
<HTML><BODY bgcolor="#000070" text="#30ebe0" link="#d0d000" vlink="#ffffff">
<center>
<img height=125 width=125 src="http://edgar.stern.nyu.edu/mgbin/nph-image?1">
</center>
<TITLE>The Department of Information Systems Homepage</TITLE>
<H2>
The Department of Information Systems</H2>
(etc.)

Code Debugging

Debugging is a normal part of the developer's life. The first line of defense is syntax checking. For example, in Perl, I can type

perl -c <progname>

to check the Perl code for syntactic correctness. If the Perl interpreter likes the code, but the http server doesn't, there is more work to be done. Fortunately, the CGI environment is flexible enough to give the developer several options for discovering the source of code problems.

When a CGI program crashes, the uninformative 500 Server Error message is displayed on the client screen. If the developer has access to the server's error log, that might provide a clue. A common error is not printing a proper header. A script without a blank line after the Content-type statements follows:

Print "Content-type: text/html\n":
Print "<TITLE>A Bad Script</TITLE>\n";

This causes the following to show up in an NSCA server's error_log file:

[Tue May 16 20:19:04 1995] httpd: malformed header from script

When this shows up by itself, check the headers.

If command-line syntax checking has not been done and the script has a syntax error, usually these errors will show up in the error_log:

syntax error in file /web/httpd/cgi-bin/bourne/break_something.pl
at line 8, next 2 tokens "priint "GET form:""
Execution of /web/httpd/cgi-bin/bourne/break_something.pl
aborted due to compilation errors.
 [Tue May 16 20:29:39 1995] httpd: malformed header from script
In this case, the print statement has a typo, which was duly reported in the error_log.

The server error logs might not provide enough information, though, or the developer might not have direct access to the logs.

In that case, my first suggestion is to test the gateway program on the host machine command line. Runtime data, such as values for environment variables, or stdin, will have to be provided. Supplying runtime data was described in the section "How the Server Passes the Data to the Gateway Program."

If the script runs without errors on the command line, but the output is still not what is expected, the problem might lie in how the gateway program is looking at the incoming data sent by the server or how the gateway program is outputting data. Developers might find it useful to generate their own log files-that is, to insert code into the gateway program to write input and output to temporary files. A basic technique in Perl is to create a dump file with code such as this:

open(DUMP, ">>my_debug_file.tmp") || die "cannot open dump file";

Then, you can write any variables that need to be examined:

read(stdin, $input_line, $ENV{CONTENT_LENGTH});
print(DUMP "$input_line\n");

This is a very useful method of debugging; the full range of stdin, command-line arguments, and environment variables can be examined. In addition, a separate file for each transaction can be created by including the process ID in the file name. In Perl, this is $$:

open(DUMP, ">>my_debug_file.$$.tmp") || die "cannot open dump file";

This code creates a new file for each run of the script.

Fetching the Contents of a URL: The http_get.c program.

Listing 20.17 shows the http_get.c code that I used in the Perl socket example shown in Figure 20.8. You'll see this program again, in Chapter 22's discussion of text search tools.


Listing 20.17. The http_get.c code.
/* http_get - fetch the contents of an http URL
**
** Originally based on a simple version by Al Globus <globus@nas.nasa.gov>.
** Debugged and prettified by Jef Poskanzer <jef@acme.com>.
*/

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>

static char* argv0;

/* Gets the data at a URL and returns it.
** Caller is responsible for calling 'free' on returned *data.
** Returns -1 if something is wrong.
*/
long
getURLbyParts( void** data, char* machine, int port, char* file )
    {
    struct hostent *he;
    struct servent *se;
    struct protoent *pe;
    struct sockaddr_in sin;
    int sock;
    int bytes;
    char buf[10000];
    char* results;
    size_t size, maxsize;
    char getstring[2000];

    he = gethostbyname( machine );
    if ( he == (struct hostent*) 0 )
    {
    (void) fprintf( stderr, "%s: unknown host\n", argv0 );
    return -1;
    }
    se = getservbyname( "telnet", "tcp" );
    if ( se == (struct servent*) 0 )
    {
    (void) fprintf( stderr, "%s: unknown server\n", argv0 );
    return -1;
    }
    pe = getprotobyname( se->s_proto );
    if ( pe == (struct protoent*) 0 )
    {
    (void) fprintf( stderr, "%s: unknown protocol\n", argv0 );
    return -1;
    }
    bzero( (caddr_t) &sin, sizeof(sin) );
    sin.sin_family = he->h_addrtype;

    sock = socket( he->h_addrtype, SOCK_STREAM, pe->p_proto );
    if ( sock < 0 )
    {
    perror( "socket" );
    return -1;
    }

    if ( bind( sock, (struct sockaddr*) &sin, sizeof(sin) ) < 0 )
    {
    perror( "bind" );
    return -1;
    }
    bcopy( he->h_addr, &sin.sin_addr, he->h_length );

    sin.sin_port = htons( port );
    if ( connect( sock, (struct sockaddr*) &sin, sizeof(sin) ) < 0 )
    {
    perror( "connect" );
    return -1;
    }

    /* Send GET message to http. */
    sprintf( getstring, "GET %s\n", file );
    if ( write( sock, getstring, strlen( getstring ) ) != strlen( getstring ) )
    {
    perror( "write(GET)" );
        return -1;
        }

    /* Get data. */
    size = 0;
    maxsize = 10000;
    results = (char*) malloc( maxsize );
    if ( results == (char*) 0 )
    {
    (void) fprintf(
        stderr, "%s: failed mallocing %d bytes", argv0, maxsize );
    return -1;
    }
    for (;;)
        {
    bytes = read( sock, &results[size], maxsize - size );
    if ( bytes < 0 )
        {
        perror( "read" );
        return -1;
        }
    if ( bytes == 0 )
            break;
            size += bytes;
       if ( size >= maxsize )
    {
    maxsize *= 2;
    results = (char*) realloc( (void*) results, maxsize );
    if ( results == (char*) 0 )
    {
    (void) fprintf(
        stderr, "%s: failed reallocing %d bytes", argv0, maxsize );
    return -1;
    }
        }
        }
    *data = (void*) results;
    return size;
    }


/* Get the data at a URL and return it.
** Called is responsible for calling 'free' on returned *data.
** url must be of the form http://machine-name[:port]/file-name
** Returns -1 if something is wrong.
*/
long
getURL( void** data, char* url )
    {
    char* s;
    long size;
    char machine[2000];
    int machineLen;
    int port;
    char* file = 0;
    char* http = "http://";

    int httpLen = strlen( http );
    if ( url == (char*) 0 )
        {
    (void) fprintf( stderr, "%s: null URL\n", argv0 );
        return -1;
        }
    if ( strncmp( http, url, httpLen ) )
        {
    (void) fprintf( stderr, "%s: non-HTTP URL\n", argv0 );
        return -1;
        }

    /* Get the machine name. */
    for ( s = url + httpLen; *s != '\0' && *s != ':' && *s != '/'; ++s )
    ;
    machineLen = s - url;
    machineLen -= httpLen;
    strncpy( machine, url + httpLen, machineLen );
    machine[machineLen] = '\0';

    /* Get port number. */
    if ( *s == ':' )
    {
    port = atoi( ++s );
    while ( *s != '\0' && *s != '/' )
        ++s;
    }
    else
    port = 80;

    /* Get the file name. */
    if ( *s == '\0' )
    file = "/";
    else
    file = s;

    size = getURLbyParts( data, machine, port, file );
    return size;
    }


void
main( int argc, char** argv )
    {
    void* data;
    long size;

    argv0 = argv[0];
    if ( argc != 2 )
        {
    (void) fprintf( stderr, "usage:  %s URL\n", argv0 );
    exit( 1 );
        }

    size = getURL( &data, argv[1] );
    if ( size < 0 )
    exit( 1 );

    write( 1, data, size );

    exit( 0 );
    }

Gateway Programming Fundamentals Check


Footnotes

The NCSA httpd distribution includes the handy cgi-handlers.pl set of useful subroutines, available via anonymous FTP at ftp://ftp.ncsa.uiuc.edu/Web/httpd/Unix/ncsa_httpd/cgi/cgi_handlers.pl.Z. There is a similar package from Steve Brenner called cgi-lib.pl, and it is retrievable from http://www.bio.cam.ac.uk/web/cgi-lib.pl.txt.

On-line documentation describing environment variables is at http://hoohoo.ncsa.uiuc.edu/cgi/env.html.

On-line documentation describing server-side include techniques and available variables is located at http://www.webtools.org/counter/ssi/step-by-step.html and, more specific to the NCSA httpd server, http://hoohoo.ncsa.uiuc.edu/docs/tutorials/includes.html.

You can find the Un-CGI package at http://www.hyperion.com/~koreth/uncgi.html.

The Perl newsgroups-for example, comp.lang.perl.announce and comp.lang.perl.misc-have frequent guest appearances from author Larry Wall.

Lynx is available from http://www.cc.ukans.edu/ and offers a browser which, if the client can live without graphics, is a quick and handy way to browse the Web.

Thanks to Aleksey Shaposhnikov for his programming labor on the Perl sockets application.

Lisa Ma, Peter Cheng, Peggy Liu, and Heshy Shayovitz worked with me to create the resumé application in the Spring 1995 Telecommunications class, Stern School of Business, Information Systems Department, New York University. Instructor: Professor Ajit Kambil.