Chapter 21 Gateway Programming I: Programming Libraries and Databases

Rules of Thumb in Gateway Programming
Perl Programming Libraries
A Perl Interface to File Transfer Protocol (FTP)
An Overview of the Relational Database Model
A Double Binary Search with a Little State Thrown In
Final Binary Search Example: The Partial Company-to-Ticker-to-Filing Application
Perl Version 5
Perl and Relational Database Packages
Illustra: An Unusual RDBMS
Pros and Cons of Relational Databases on the Web
Additional Code
Gateway Programming Libraries and Databases Check

The choice of Perl as a gateway scripting language is a fortuitous one for several reasons:

Many Web server packages include Perl routines that can make the Web developer's job easier. As I will show, support for Web-integrated Wide-Area Information Search (WAIS) full-text searching is built into the NCSA httpd distribution.
Perl is well documented and has a set of active Usenet newsgroups (comp.lang.perl.*), where Perl author Larry Wall holds court. Tom Christiansen also maintains an excellent Web site (http://mox.perl.com). I also maintain a site (http://edgar.stern.nyu.edu/adv-soft) dedicated to Perl gateway programming, which includes lecture notes, class projects, advanced topics, and more. Randall Schwartz's excellent Unix Review articles are available on-line at http://www.stonehenge.com/~merlyn/UnixReview/.
New software initiatives often are accompanied by auxiliary Perl gateway routines written as a courtesy to the Internet community by the author(s) or by other contributors. The transition to Perl 5, for example, was made easier by Lincoln Stein's important CGI.pm and GD.pm Perl 5 modules.
It often is possible to mix and match Perl library routines to suit a specific need, even when the original intent of the routine is far removed from the current business domain. In the section "Perl Programming Libraries," I'll show an example; in my SEC EDGAR Filings project work, I found a set of Perl routines originally intended for astronomical calculations and used them to implement a date search.

Rules of Thumb in Gateway Programming

Suppose that I, a Web developer, have been charged with the task of constructing a program or set of programs that uses a third-party software package. The package could be a database, a text indexing or retrieval tool, or an image-manipulation package-in short, any one of myriad tools that a Web site might covet. Are there a general set of guidelines to help me approach the problem of software integration in the Web environment? Here is one reasonable approach to the problem:

First, follow the ancient wisdom of reading the manual. At a minimum, read the command-line syntax guidelines and any README files that came with the software distribution.
Experiment with the software to be integrated on the command line. Web developers should make absolutely sure that they understand its behavior under a large range of conditions (input data, system load, or other running processes that might have an interaction effect).
Search the Internet for toolkits that already have been developed by the software authors or by other interested parties. If I wind up reinventing the wheel, it's very likely that I'll have an inferior product, and it's quite certain that I will have wasted a lot of time. Use basic Internet search facilities such as Archie, or Web search facilities such as InfoSeek or Excite.(See note)
Search the Internet for Usenet newsgroups-in particular, frequently asked questions (FAQs) or on-line manuals that might address the proposed integration. Suppose that I want to use Perl to integrate a Sybase database engine at my Web site. Imagine my joy when I discover an on-line manual on this topic-actually, an Internet Draft.(See note) AltaVista (http://altavista.digital.com) and DejaNews (http://www.dejanews.com) offer two excellent Web interfaces to a Usenet search.

Perl Programming Libraries

Web developers who use Perl as a scripting language should be aware of the standard Perl library routines. For example, look.pl implements a binary search, as I'll demonstrate shortly. It is also highly recommended for the Web developer to build global subroutine libraries that the whole installation can share. The NYU EDGAR site, for example, uses the simple routines described in the following sections.

`form_init`

The subroutine form_init shown in Listing 21.1 builds an associative array; the more common filing types are the key values. Thanks to Jim Risser for his assistance on this.

Listing 21.1. The form_init subroutine.

sub form_init { %forms = ( '10-K', 'Annual Report', '10-K/A', 'Ann. Rpt. Amendment', '10-Q', 'Quarterly Report', '10-Q/A', 'Qtr. Rpt. Amendment', '8-K', 'Current Event', '8-K/A', 'Curr. Event Amend.', 'SC 13D', '>= 5% Acquisition', 'N-2', 'Closed-End Fund', 'N-1A','Open-End Fund', 'SC 13D/A', 'Acquis. Amendment', 'NSAR-A', 'Semi-annual Fund Rpt.', 'NSAR-B', 'Semi-annual Fund Rpt.', '13F-E', 'Mutual Fund Holding', '485', 'Post-Eff. Fund Prospectus', '424B2', 'Prospectus Supplement', '424B1', 'Prospectus Supplement', '424B5', 'Prospectus Supplement', '485BPOS', 'Post-Eff. Fnd Pspcts.', '485APOS', 'Post-Eff. Fnd Pspcts.', 'DEF 14A', 'Definitive Proxy', 'S-3', 'Stock/Bond Regis.', 'S-3/A', 'Stock/Bond Reg. Amnd.', 'SC 13G', '>= 5% Acquisition', 'SC 13G/A', 'Acquis. Amendment', 'S-8', 'ESOP', '11-K','ESOP Ann. Rpt.', '497','Fund Prospectus', '497J','Fund Prospectus', 'PRE 14A','Prelim. Proxy', ); }

I can access the filing descriptions quickly from any program that includes this little script just by saying something like

$desc = $forms{'10-Q/A'};

`nyu_trailer`

The nyu_trailer subroutine shown in Listing 21.2 outputs a nicely formatted trailer at the bottom of standard reports showing a timestamp and the affiliation credits.

Listing 21.2. The nyu_trailer subroutine.

sub nyu_trailer{ local($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime; local(@days) = ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'); local(@months) = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'); print "<p>\nGenerated by the <a href=\"http://is-2.stern.nyu.edu/\">"; print "<B>NYU Stern School</B> of Business</a>"; print " on ", sprintf("%02d:%02d:%02d on %s %d, %d", $hour, $min, $sec,$months[$mon], $mday, 1900+$year); print "</body></html>\n"; }

`html_header`

The html_header subroutine shown in Listing 21.3 is a handy tool to ensure a proper MIME header as the first output of a CGI script. It is available in certain modified versions of the cgi-lib.pl, which is distributed with NCSA httpd server software; note that it takes an argument: the HTML <title> of the returned document.

Listing 21.3. The html_header subroutine.

# # html_header sends an HTML header for the document to be returned # sub html_header { local($title) = @_; print "Content-type: text/html\n\n"; print "<html><head>\n"; print "<title>$title</title>\n"; print "</head>\n<body>\n"; }

`home`

This small subroutine, which is shown in Listing 21.4, is actually part of our subroutine collection edgarlib, and is handy to display after a generic CGI script finishes outputting its information. home supplies default text and a default GIF to allow the user to get back to the EDGAR home page; both the text and the GIF can be overridden by calling the CGI script.

Listing 21.4. The home subroutine.

sub home { local ($gif,$text) = @_; # if nothing supplied, set default gif and text if ($#_ < 0) { $gif = "back.gif"; $text = "Return to our home page"' } print "<HR>"; print "<a href=\"http://edgar.stern.nyu.edu/\">"; print "<img src = \"http://edgar.stern.nyu.edu/icons/$gif\">"; print "$text </a>"; print "<hr>"; }

Tip

The Web developer should build a global library of useful subroutines and use them as needed. Having a shareable subroutine local to a script is wrong for three major reasons:

Other developers might not know about your code and will be forced to write their own similar routines, thus wasting time and energy.

Local subroutines can unnecessarily lengthen scripts and make them harder to read and maintain.

Having global subroutines in a standard library directory encourages developers to enhance them or spin them off for related business purposes.

Aside from these pleasant time-saving utilities, there often are more specialized needs that subroutines can handle nicely. The EDGAR Internet project, for example, has a need to convert a loosely phrased date constraint such as "give me all the SEC filings from six months ago until now for 3COM Corporation" to a more concrete numeric form for proper processing.(See note)

In order to perform date conversion, I searched the Internet Perl archives and located a date package that originally was intended for astronomers! With some minor modifications, it can be plugged right in, as shown in Listing 21.5.

Listing 21.5. The code for edgardate.

#--------------------------------------# # Edgar Date conversions. # #--------------------------------------# # These functions are copied from # "Practical astronomy with your calculator" # Peter Duffett-Smith # Cambridge University Press # Second Edition # # EDGAR Modification: don't allow fractional answers, round and take integers. # package date; # Convert date to julian day number. sub main'dtoj { ($d, $m, $y) = split('-', $_[0]); if ($m == 1 || $m == 2) { $y-; $m += 12; } if ($y > 1582 || ($y == 1582 && $m > 10) || ($y == 1582 && $m == 10 && $d > 15)) { $A = int $y / 100; $B = 2 - $A + int $A / 4; } $C = int 365.25 * $y; $D = int 30.6001 * ($m + 1); $B + $C + $D + $d + 1720994; # Here, truncate the annoying 0.5 at end } # Convert julian day number to date. sub main'jtod { $_[0] += 0.5; $I = int $_[0]; $F = $_[0] - $I; if ($I > 2299160) { $A = int (($I - 1867216.25) / 36524.25); $B = $I + 1 + $A - int $A / 4; } else { $B = $I; } $C = $B + 1524; $D = int (($C - 122.1) / 365.25); $E = int 365.25 * $D; $G = int (($C - $E) / 30.6001); $d = $C - $E + $F - int 30.6001 * $G; $d = int($d); # EDGAR mod: stop annoying fractional answers. $m = $G < 13.5 ? $G - 1 : $G - 13; $y = $m > 2.5 ? $D - 4716 : $D - 4715; "$d-$m-$y"; } # Convert date to day of week (0 = Sun 6 = Sat). sub main'd2w { $A = (&main'dtoj($_[0]) + 1.5) / 7; $X = ($A - int $A) * 7; $Y = $X - int $X; if ($Y > .5) { $X = int $X + 1; } elsif ($X > 0) { $X = int $X; } $X; } 1;

With the help of edgardate, there now is a path to convert the phrase "six months ago" to a specific number-and then arithmetic can be done on that number. The answer then is converted back to a typical date form (day-month-year) for output; or a Perl routine easily can create a new date form if necessary.

I will show how edgardate is used in the practical business of retrieving Securities and Exchange Commission corporate filings in the section "Binary Search Example II: A Corporate Filings Lookup."

A Perl Interface to File Transfer Protocol (FTP)

Perl libraries make it easy to automate within a script an FTP session. The standard Perl library ftplib.pl can be used; the EDGAR team has enhanced this in order to provide useful error routines in a new package (edgarftp.pl), which is located in the "Additional Code" section of this chapter. All the standard FTP commands (logging on, changing directories, listing files, getting files, and logging out) can be scripted. This is useful particularly to automate a data feed from an FTP server.

Caution

SunOS (BSD-style UNIX) and the System V-style UNIX-for example, Solaris 2.X or Silicon Graphics' Irix 5.x-have different socket implementations. To use an FTP interface in Solaris or Irix, it is necessary to have the file socket.ph. This can be created by the Perl utility h2ph: h2ph < socket.h > socket.ph.

The simple example shown in Listing 21.6 logs onto an anonymous FTP server at www.sec.gov, changes to a known directory, and then FTPs a known file to my machine.

Listing 21.6. Using Perl to automate an FTP session.

#!/is-too/local/bin/perl # # ftp.pl: a Perl FTP interface that does the following: # # 1) Log into FTP site www.sec.gov # as ftp (anonymous) with password email@address.nyu; # # 2) cd /edgar/daily-index/1996/QTR2 # # 3) get a file # # 3) Send e-mail to mark: "File *** is ready.", where *** is the name of file. # # 4) send error e-mail if: # a) can't log onto the host; # b) can't cd; # c) can't find file. # ###################################################################### # require libraries require '/usr/local/lib/perl/ftplib.pl'; require '/usr/local/lib/perl/edgarlib'; require '/usr/local/lib/perl/edgarftp.pl'; #################################################### # initialize variables $mail_address = "mark\@edgar.stern.nyu.edu"; $edftp'address = "ftp.sec.gov"; $edftp'userid = "ftp"; $edftp'password = "kirill\@edgar.stern.nyu.edu"; $edftp'directory = "/edgar/daily-index/1996/QTR2"; # where files live $edftp'filename = "company.960628.idx"; $log = "/tmp/log.$$"; ################################# main ######################## main: { open (LOG, ">$log") || die "Could not open $log"; open (MAIL, "|mail $mail_address") || die "Couldn't send mail!"; $d = 'date'; print LOG "\nLog started on $d"; &edftp'logon_to_server; &check_errors; &edftp'goto_directory; # cd to $edftp'directory &check_errors; &edftp'ftp_the_file; &check_errors; # Send mail message to $mail_address about the successful creation of file. print MAIL "File $edftp'filename is ready\!\!\!\n"; } ################### # subroutines # ################### sub check_errors { $message = &ftp'error; if ($message) { print MAIL "\nError message: $message\n"; print "\nError message: $message\n"; print LOG "\nError message: $message\n"; close(MAIL); &edftp'logout; print LOG "\nEmergency logout! All handlers closed!\n"; close(LOG); exit 1; } } # end subroutine

This script places the file company.960628.idx in the directory from which the script is executed. For nonanonymous logon, it is a simple matter of changing the user ID and password. Note how the routine &check_errors is called after each FTP action; it is better to be safe than sorry when running an automated FTP script.

Tip

If the file(s) to be FTP'd are not known ahead of time, the script easily can be modified to produce a directory listing and fill an array. The download then can set up a loop to pick up as many files as desired. Similarly, directory structures can be explored programmatically if they cannot be predicted before the logon.

An Overview of the Relational Database Model

The relational data model is a simple yet powerful way to represent data as tables: the columns are fields of the database, and the rows are records. Flat file records easily can be divided by the Perl split command into its component fields. If the tables are related by common key fields, they can be joined to perform a query on multiple tables. Furthermore, commercial engines allow multiple key fields in the record to speed retrieval of queries based on those fields.

The web developer should be prepared to invest significant time to preprocess a data store in order to efficiently organize the data for later query. Efficient queries on a single key field can be performed by Perl or by a third-party database package; both depend on a reasonable organization of the underlying data, however. If data is being supplied to a site by an outside source (such as stock quote data or newsfeed data), it is important to analyze the source in terms of format. Is it regular? Are there headers that must be dispensed with? Are there occasional exceptions to the expected format?

If such questions are answered up front, unpleasant surprises can be avoided. The dangers of database contamination increase when an automated pipe has been set up (such as an automated e-mail handler) between the data feed provider and the Web site.

In the following sections, I review Perl techniques to interface effectively with tabular data. I then move onto a discussion of interface design with commercial database packages, such as Sybase and Oracle, and conclude with real-life code using the NYU EDGAR site's relational database engine, Illustra.

Binary Searches

The standard Perl library routine, look.pl, is a very important tool for web developers. If the site has a large text file, and that file is sorted on a key field, the look.pl routine implements a binary search. Binary searches are the method of choice in a Web application that queries a large text file-the guiding principle being, of course, to avoid having the user stare at a busy cursor.

A binary search on a one-million-record file, for example, takes only a maximum of 20 iterations before the correct record is found.(See note)If the application did not take advantage of a binary search and processed records sequentially in sorted order, a query starting with yyy might have to go through 990,000 records before returning output!

Binary Search Example I: A Simple Rolodex Application

I will start with an application that asks the user to enter all or part of a last name and then returns the matching name(s), along with the e-mail address and full name. The dataset for this application, rolodex.dat, is very simple. Here are some sample records:

AALA maala Maria Aala AARON maaron Marvin Aaron ABAD cabad Charmaine Abad ABAD cabad0 Christina M Abad ABAD rabad Roderick P Abad ABAS mabas Muhamad-Hafiz Abas ABAZIS cabazis Constantin Abazis

The field layout follows, from left to right: the last name, in capital letters; the user ID on the computer system; and, finally, the full name of the user. The fields are separated by two or more spaces. Note that the third field contains an embedded space, so it is important in the Perl parsing routine to split the record properly into three fields and not artificially separate the full name. Also note that if a file is not sorted by the key (the search field that is the input to the HTML form), it should be preprocessed; the key field should be flush left on the record, and the file should be sorted ascending by the key. For more complex searches, such as multiple key field queries and multiple data tables, a formal relational database engine is more appropriate.

Figure 21.1 shows the initial screen of the Rolodex application, rolodex.html. The user asks for all names starting with the characters aa.

Figure 21.1 : The Rolodex application starting point.

Listing 21.7 shows the HTML source code for rolodex.html.

Listing 21.7. The HTML source code for rolodex.html.

<TITLE>Rolodex Application </TITLE> <H2>Rolodex Application</h2> <ol> <li>Last Name....: <b> Required. </b> Enter the first few letters. <br> <li>Hit Limit: <i> Optional </i> Default is the first 100 hits. </ol> <p> <FORM METHOD="POST" ACTION="http://edgar.stern.nyu.edu/mgbin/rolodex.pl"> <b>Last Name:</b> <INPUT NAME="name"> <P> <b>Set Hit Limit? </b> <SELECT NAME="limit-List"> <OPTION> no limit - I can sit here all day! <OPTION> 1000 hits <OPTION> 500 hits <OPTION> 250 hits <OPTION SELECTED> 100 hits <OPTION> 50 hits <OPTION> 25 hits <OPTION> 10 hits <OPTION> 5 hits <OPTION> 1 hit </SELECT> <P> <b> Debugging? </b> <SELECT NAME="debug"> <OPTION> debug - my browser is giving me flaky results! <OPTION SELECTED> no debug - smooth sailing! </SELECT> <P> Submit choices: <INPUT TYPE="submit" VALUE="Retrieve Addresses">. Reset form: <INPUT TYPE="reset" VALUE="Reset">. </FORM> <HR> <A HREF="http://edgar.stern.nyu.edu/EDGAR.html"> <img src="http://edgar.stern.nyu.edu/icons/back.gif"> Return to our Home Page.</A>

Note that I define a default of 100 hits (name matches), and I allow the user to set debugging on, which echoes the form variables on top of the result screen.

The result of the query launched in Figure 21.1 is shown in Figure 21.2. Two names are found starting with aa.

Figure 21.2 : A response from the Rolodex CGI script.

Having shown the dataset, the HTML, and the screen snapshots of the Rolodex before and after, it is time now to review the Perl CGI code, rolodex.pl, as shown in Listing 21.8.

Listing 21.8. The Perl CGI code, rolodex.pl.

#!/usr/local/bin/perl # # rolodex.pl # require 'edgarlib'; # this has the &home routine in it require 'look.pl'; # Required for binary search require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; &html_header("Rolodex"); # from our (modified) cgi-lib.pl # Get the input &parse_request; # from the (standard) cgi-lib.pl # # # Now we have an associative array of Form Variables and # values, %query. # # If debugging on, show all the keys and values from the form. # if ($query{'debug'} =~ /no debug/) { } else{ &show_debug_info; } # # If name missing, let them retry the form. # if ($query{'name'} eq "") { print "<h1>Error! It seems you did not enter a last name.</h1><p>"; print "<A HREF=\"http://edgar.stern.nyu.edu/mgtest/rolodex.html\">"; print "Try Again! </A> <hr>"; &home; # let them go back to edgar home if they want to. exit 1; } $path = "/usr/users/mark "; # change this for your system $rolodex = "$path/rolodex.dat"; &field_head(); # for the Results columns. print"<HR>"; $hitctr = 0; # hit counter variable open(ROLODEX, $rolodex) || die "cannot open $rolodex data file"; # # &look(*ROLODEX, $query{'name'},0,1); # use the assoc array while (<ROLODEX>){ last unless /^$query{'name'}/i; # if move beyond the match, # exit this loop immediately @line = split(/\s\s+/); $hitctr++; if ($hitctr > $query{'limit-List'}) { $hitctr-; # must adjust this to get it right. print "<i>User limit of $hitctr reached...ending search.</i>"; last; } print "<pre>"; printf(" %-20s %-15s %-30s",$line[0],$line[1],$line[2]); print"</pre>"; } # end of WHILE close(ROLODEX) || die "cannot close $rolodex data file"; print "<hr>"; print "Your search had <b>$hitctr</b> hit(s).<p>"; print "<A HREF=\"http://edgar.stern.nyu.edu/mgtest/rolodex.html\"> Next</A> Rolodex search!"; # &nyu_trailer; # present credits, and timestamp &home; # little subroutine to show links to 'go home' exit 0; # # field_head: line up the output columns # sub field_head{ $fhdr="<B>Last Name</B>"; $chdr="<B>E-Mail</B>"; $shdr="<B>Full Name</B>"; print "<pre>"; printf(" %-25s %-20s %-30s ",$fhdr, $chdr, $shdr); print "</pre>"; } # # debug_info: if user selects the debug option, echo the # form keys and values before the query output. # sub show_debug_info { while (($key,$value) = each(%query)) { print "The value of $key is $value <br>"; } } exit 0;

The rolodex.pl program implements a binary (in other words, an extremely fast) search on the rolodex.dat data file. In particular, note the &look routine that positions a pointer on the first record in the sorted data file, if such a record matches the user's input. Then, the while loop processes all records that continue to match the input-aa, in this example. The last; statement executes after the pointer moves past the names starting with aa-thus, both the time to find the first matching record and the time of program execution are minimized.

The remainder of the code is concerned with such niceties as checking to make sure the user did not leave the name field blank, formatting the result columns, and HTMLizing the results page above and below the data itself. The Stern School of Business uses the Rolodex application, which was embellished by Alan Eisner; you can find it on-line at http://www.stern.nyu.edu/~aeisner/rolodex.html.

Before I leave the Rolodex application, I present the wrong way to do things in Listing 21.9; rolo_bad.pl is instructional because it shows how not to search a flat file. The code is identical to rolodex.pl except for the logic inside the while loop and the removal of the look call just before the while loop.

Listing 21.9. How not to do things-the rolo_bad.pl code.

#!/usr/local/bin # rolo_bad.pl: code excerpt # # code deleted.... # while (<ROLODEX>){ @line = split(/\s\s+/); # split on two or more spaces if ($line[0] =~ /^$query{'name'}/i) { $hitctr++; if ($hitctr > $query{'limit-List'}) { $hitctr-; # must adjust this to get it right. print "<i>User limit of $hitctr reached...ending search.</i>"; last; } print "<pre>"; printf(" %-20s %-15s %-30s",$line[0],$line[1],$line[2]); print"</pre>"; } # end of first if elsif (($hitctr > 0) && ($line[0] !=~ /^$query{'name'}/i)) { last; # get out, if went too far } } # end of While close(ROLODEX) || die "cannot close $rolodex data file"; # # remainder of code deleted. . . #

By not using the look routine to accomplish a binary search, the rolo_bad.pl code reads the data file record by record until it finds the first occurrence of a matching name (the first field of the record, flush left). It then processes that record and outputs the results. At least the program is smart enough to drop out of the loop if it detects that it already has found one or more matches and that it is beyond the eligible section (this is, the compound if statement and the last; command). The problem, of course, is that if the data file is very large and the user happens to enter a letter toward the end of the alphabet (for example, x), a lot of unnecessary disk reads will occur before the first match is found.

Tip

If flat files are to be used as inputs to common CGI queries, spend the time to preprocess these datasets (sort them by key field). Make sure that the internal logic of the CGI search routine maximizes performance by minimizing the number of file reads performed.

In command-line Perl search applications, large keyed flat files often are loaded into associative arrays (key/value pairs). The one-time penalty of the array load is compensated by very fast retrieval in subsequent queries (because the data is now in memory), assuming that the user stays in the program. The problem of doing so in the current Web environment is the statelessness of the HTTP connection; after the client/server connection is dropped, the server loses the memory of the previous request and the CGI program would have to load the array again. If a long-lived session ID (such as a state) is enabled in future versions of HTTP, an up-front associative array load should be seriously considered for eligible CGI query applications. A large associative array load should not be attempted on a machine that is low on RAM, however. If memory is a constraint, a binary search from disk will have to suffice.

Binary Search Example II: A Corporate Filings Lookup

I now move onto the more complex problem that again uses the principle of fast binary search, using the look.pl routine.

The application is a search of the massive SEC EDGAR corporate filings index, which, as of June 1996, is in excess of 45MB (not to be confused with the filings data itself, which is in excess of 40GB).

Here are three sample records of the company.sorted index file:

EQUITY SECURITIES TRUST SR 3 SIGNAT SR GABELLI COMM INCOME T 497J 905265 19940627 edgar/data/905265/0000903112-94-000628.txt GABELLI ASSET FUND 485BPOS 783898 19950428 edgar/data/783898/0000891554-95-000049.txt GABELLI ASSET FUND 497 783898 19950505 edgar/data/783898/0000891554-95-000053.txt

The company.sorted file has been presorted by the company name (the far-left field). The other fields, from left to right, are the filing type (497J, 485BPOS, and 497, in this example), the filing date (in YYYYMMDD format), and the full physical path of the actual filing data on the SEC Government machine (ftp://www.sec.gov).

Given the immense size of the company.sorted index file (it's more than 40MB and has the scary feature of growing every business day), it would be foolhardy to avoid using a binary search.

Tip

Don't pound a square peg into a round hole by using a binary search as a universal search mechanism. Often, users need fast access by multiple keyed fields. In such cases, a conventional relational database should be favored (such as Sybase, Oracle, MS-SQL, and so on) over the Perl techniques presented in this section. Sybase has a web.sql product, for example, to act as the glue between the HTML form and the SQL back end. In a separate development, methods are being explored to bypass the common gateway interface completely to satisfy Web database access needs. Although it is outside the scope of this book, one of the most interesting methods is Sun's Java-based JDBC initiative; the mid-1996 API is described at http://splash.javasoft.com/jdbc/.

Figure 21.3 shows the search input screen. The user enters FMR as the company name; selects 13F-E as the filing type (Mutual Fund quarterly holding report), selects 100 as the maximum answers to be returned, selects the last six months (from the current date) as the date range, and declines the debugging option (which would display raw query data to the screen and is used to identify and help users with problematic Web browsers). Then it's up to my script to return an answer efficiently. This EDGAR application is on-line at http://edgar.stern.nyu.edu/formlynx.html.

Figure 21.3 : A query on FMR's SEC disclosure documents.

Before I walk through the code, you might find it helpful to see Figure 21.4-the answer to the FMR query entered in Figure 21.3.

Figure 21.4 : Three hits are found for FMR 13F-E filings that were filed within the last six months.

To complete the normal cycle of this application, if the user clicks the first FMR hotlink shown in Figure 21.4, for example, an FTP request to town.hall.org is kicked off and the raw SEC filing is returned. Figure 21.5 shows the first page of this document. Note the digital security signature that is placed on every filing for authentication purposes. The FTP hotlinks in Figure 21.4 were formed by the same Perl program that accomplished the binary index search. It's time to look at that code now; see Listing 21.10.

Figure 21.5 : The SEC server returns an FMR disclosture document.

Listing 21.10. The code for formlynx.pl.b.

#!/usr/local/bin/perl # ----------------------------------------------------------- # formlynx.pl.b Supports formlynx.html # # Piotr Kurdusiewicz # # Maintenance Log # -------------- # 8/14/94 MG : changed URL "file" to "ftp" # 1/10/95 MG: log the form and the company # # 4/12/95 PL: Use associative array form input. sub match{ require 'edgarlib'; require 'edgardate'; # Julian date routines require 'look.pl'; # Required for binary search require 'ctime.pl'; $Date=&ctime(time); # nice formatting of date and time from the timestamp. $sdate=chop($Date); # require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; &html_header("Form and Company Results"); # from cgi-lib.pl $julnum = &j_number(); # Change current date to Julian read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); @pairs = split(/&/, $buffer); #Split the name-value pairs foreach (@pairs) { ($key, $value) = split (/=/, $_); $value=&deweb($value); $form{$key} = $value; # # If user inputs a Filing Type not on the picklist, use it. Else, # use the Filing Type selected on the picklist. # if ($form{'filing-override'} eq "") { ($userform, @garbage) = split (/\s/, $form{'Form-Pick-List'});} else { $userform = $form{'filing-override'}; } } # # Not allowed to do this query without a company input. # if ($form{'company'} eq "") { print "<h1>Error! It seems you did not enter a company name.</h1><p>"; print "<A HREF=\"http://edgar.stern.nyu.edu/formlynx.html\">"; print "Try Again! </A>"; print "<hr>"; print "However, if you did enter a company, "; print "retry this program with the debugging flag set on "; print "and report your results to us."; print "<hr>"; print "<a href=\"http://edgar.stern.nyu.edu/\">"; print "<img src=\"http://edgar.stern.nyu.edu/icons/back.gif\">"; print "Return to the Edgar Home Page </A>"; exit 1; } # # Do date arithmetic. Convert the date range to the Julian numbers. # if($form{'date-range'} eq "Last Week"){ $matchdate=&main'jtod($julnum-7); }elsif($form{'date-range'} eq "Last Two Weeks"){ $matchdate=&main'jtod($julnum-15); }elsif($form{'date-range'} eq "Last Month"){ $matchdate=&main'jtod($julnum-30); }elsif($form{'date-range'} eq "Last Six Months"){ $matchdate=&main'jtod($julnum-180); }elsif($form{'date-range'} eq "Last Year"){ $matchdate=&main'jtod($julnum-360); } else { $matchdate=&main'jtod(2449354); # this number is 1/1/94 (beginning of # Edgar Internet Project) } ($nday, $nmon, $nyear)=split(/-/,$matchdate); if($nmon < 10){ $nmon=join('','0',$nmon); } if($nday< 10){ $nday = join('','0',$nday); } $matchdate=join('',$nyear,$nmon,$nday); $forms = "/usr/local/edgar/web/docs/company.sorted"; # the big index $logpath="/web/profile/logs/"; $logname="formlynx.log"; # keep log. $logfile = ">>$logpath$logname"; &aux_vars; # for max_hit counter, and debugging info on/off. &field_head(); # for the Results columns. print"<HR>"; $hitctr = 0; # hit counter variable open(COMPANY, $forms) || die "cannot open the INPUT FILE"; &look(*COMPANY, $form{'company'},1,1); # do the binary search open(LOGFILE, $logfile) || die "problem opening Log file"; # # Log appropriate info. # print LOGFILE "$Date | $ENV{REMOTE_HOST} | $ENV{REMOTE_ADDR} | $ENV{HTTP_USER_AGENT} | $form{'company'} | $userform | $matchdate \n"; close LOGFILE || die "problem closing logfile\n"; while (<COMPANY>){ last unless /^$form{'company'}.*/i; @line = split(/\s\s+/); #print "<br>Line is @line\n"; #print "<br>User form is $userform"; if (($line[1] =~ /$userform/i || $userform eq "ALL")&& ($line[3] >= $matchdate)) $company = "<A HREF=ftp://town.hall.org/$line[4]>$line[0]</A>"; @date = split(//,$line[3]); $date = "$date[4]$date[5]-$date[6]$date[7]- $date[0]$date[1]$date[2]$date[3]"; $hitctr++; if ($hitctr > $ulimit[1]) { # stop if maximum hits exceeded. $hitctr-; last; } print "<pre>"; printf(" %s %-10s %-6s %30s",$date,$line[1],$line[2],$company); print"</pre>"; } # end of IF } # end of WHILE close(COMPANY) || die "cannot close the INPUT FILE"; # this is critical to # reset the line pointer; # else erratic. print "<hr>"; $hnew = $hitctr; print "Your search had <b>$hnew</b> hit(s).<p>"; print "<A HREF=\"http://edgar.stern.nyu.edu/formlynx.html\"> Next</A> Form search!"; # &nyu_trailer; # timestamp the report exit 1; } #end of "eval &match" sub field_head{ $fhdr="<B>FORMS</B>"; $chdr="<B>COMPANY NAME</B>"; $shdr="<B>CIK CODE</B>"; $dhdr="<B>DATE FILED</B>"; print "<pre>"; printf(" %-10s %-10s %-10s %s",$dhdr, $fhdr, $shdr, $chdr); print "</pre>"; } sub j_number{ local($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime; $smon=$mon+1; $syear=$year+1900; $timedate=join('-',$mday,$smon,$syear); &main'dtoj($timedate); } sub debug { print "<pre>"; print "Debugging Information \n"; print "*----------------------------------* \n"; #print "The user limit is $ulimit[1] \n"; print "User Limit is $ulimit[1] \n"; $i = 0; print "pair 0 is $pairs[0] \n"; for ($i,$#pairs,$i++) { print "pair $i is $pairs[$i] \n"; $i++; } print "*----------------------------------* \n"; print "</pre>"; } sub aux_vars { @ulimit = split(/=/,$pairs[4]); @udebug = split(/=/,$pairs[5]); $ulimit[1]=~y/+/ /; $ulimit[1]=~y/a-z/A-Z/; $udebug[1]=~y/+/ /; $udebug[1]=~y/a-z/A-Z/; if ($ulimit[1]=~ /^no/i) { $ulimit[1] = 99999; } # no limit means No Limit. if ($udebug[1]=~ /^debug/i) { &debug; } } eval '&match'; exit 0;

Observations about `formlynx.pl.b`

The code line

look(*COMPANY, $form{'company'},0,1);

is extremely important. The binary search is looking up the key entered by the user-in this case, the company name-on the file associated with the file handle COMPANY (a 45MB file).

The two numeric parameters 0 and 1 are the $dict and the $fold flags. From the Perl syntax manual, a $dict flag of 0 causes all characters in the company.sorted file to participate in the lookup. If the $dict flag was 1, only letters (A-Z), numbers (0-9), and blanks would participate. Because the company.sorted file contains companies such as G&K SERVICES INC, it is actually quite wrong to set $dict to 1 in this case (I know from bitter experience). Interestingly, this error has the effect of rendering companies immediately following the problematic entry-for example, GABELLI ASSET FUND-invisible (unlocatable by the erroneous binary search).

If $dict is set to 0, however, all is well and the & character participates in the binary lookup. The $fold flag, if nonzero, converts uppercase (A-Z) to lowercase (a-z) during the comparison. This is properly set to 1 in this example; the user is likely to enter a lowercase company input.

A few lines later, the program enters a while loop; all the output assembly occurs here and some valuable techniques are illustrated. Listing 21.11 shows the code fragment from formlynx.pl.b.

Listing 21.11. The code fragment from formlynx.pl.b.

while (<COMPANY>){ last unless /^$form{'company'}.*/i; @line = split(/\s\s+/); #print "<br>Line is @line\n"; #print "<br>User form is $userform"; if (($line[1] =~ /$userform/i || $userform eq "ALL")&&($line[3] >= $matchdate)) { $company = "<A HREF=ftp://town.hall.org/$line[4]>$line[0]</A>"; @date = split(//,$line[3]); $date = "$date[4]$date[5]-$date[6]$date[7]- $date[0]$date[1]$date[2]$date[3]"; $hitctr++; if ($hitctr > $ulimit[1]) { $hitctr-; last; } print "<pre>"; printf(" %s %-10s %-6s %30s",$date,$line[1],$line[2],$company); print"</pre>"; } # end of IF } # end of WHILE

Note that the "look" statement merely positioned us at the first record in the index file matching the company key. Now, the while loop must process all such matching records. The line

last unless /^$form{'company'}.*/i;

does just that. It says, "if I am still positioned on the key value, keep processing (in this case, shaping the hyperlinks and formatting the output), or else just get out; I'm done." The /i qualifier means a case-insensitive search. Recall that the company.index company names are in uppercase and the user entry might well be lowercase (alternatively, the developer can use Perl to translate lower- to uppercase).

The matching records are processed and the proper HTML is wrapped around the company name field to provide an FTP hotlink. Refer again to Figure 21.4 for the screen snapshot of the answer.

Note also the necessity of outputting a <pre> tag to prepare the output for some field formatting. If I omit the <pre> tag, the printf statement will not have its desired effect and the output fields won't line up. Having seen <pre>, HTML watchdogs immediately will look for a </pre> tag; I supply one when the output loop is finished. Another way of formatting output in Perl is the FORMAT statement, which I demonstrate in the section "Perl and Relational Database Packages." The <pre> tag also is necessary in that case.

A variation on the theme of wrapping Perl hypertext links around text output is the use of system commands, such as a sort, before presenting the final output.

Figure 21.6 shows the identical application with one twist; the filings returned now are sorted in date order, with the most recent filings first.

Figure 21.6 : The filings retrieved now have been presorted in date order, with the most recent filings first. This answer corresponds to companies starting with AP and forms containing 10-K.

In this example, the user filled out the form shown in Figure 21.3, with company equal to AP and form equal to 10-K.

The sort clearly adds value to the application, and it's up to the web developer to integrate properly and efficiently the sorting process before the output is presented. Instead of showing the code here (because it is largely similar to formlynx.pl.b), I defer the code listing to the "Additional Code" section at the end of the chapter (Listing 21.21, the sortform.pl. code).

One technical note: It's possible to sort the file to disk and then process it in Perl for Web consumption, or, equivalently, to open an indirect file handle in Perl and pipe the output of the sort command to this file handle. The former method is better in the debugging stage because it is very handy to examine a physical disk file when the Perl script is not behaving in an expected manner. The latter is easier on the eyes and conforms to the general principle of not committing disk resources unnecessarily. In addition, if portability to other platforms (Windows NT, for example) is an issue, the Perl sort command is preferable; it's best to avoid operating system-specific functions whenever possible.

A Double Binary Search with a Little State Thrown In

Time to get more involved. The preceding example demonstrated a user asking for a particular company and a binary search retrieving all hits from a company.sorted index file.

Now imagine what I faced when Len Zacks (the president of Zacks Investment Research) asked me, "Suppose my users know the ticker symbol in advance but not necessarily the company. How can we interface to your filings retrieval system?" (See note)

"Can you get me a sorted file of tickers and company names?" I slyly retorted, stalling for time.

"Yes," was the prompt answer. Now I had the problem of actually building something.

What is required for the "Known Ticker" application?

A reasonable way to pass the preordained ticker on the URL command line.A binary search on the ticker file to locate the corresponding company name. The ticker file contains approximately 8,230 companies.A second binary search to retrieve the filings after the company name is in hand (in fact, identical to the goals of formlynx.pl.b).

The trick to solving the first problem is to provide the Zacks programmers with a Perl script URL rather than an HTML URL. Why? Because the Perl script, when accessed as a URL, can

Accept the ticker as a QUERY_STRING argumentOutput valid HTML with no functional difference from a regular HTML document

I created http://edgar.stern.nyu.edu/mgbin/zacksnew.pl and instructed Susan Krivulis, my Zacks liaison, to feed the ticker as a QUERY_STRING to this program. A Zacks user who is interested in Exxon (ticker symbol XON), for example, would be provided a URL to link to http://edgar.stern.nyu.edu/mgbin/zacksnew.pl?XON.

Listing 21.12 shows the code for zacksnew.pl.

Listing 21.12. The code for zacksnew.pl.

#!/usr/local/bin/perl # # Mark Ginsburg 3/95 # # zacksnew.pl # # user has pre-selected ticker and it is passed to us on the # command line QUERY_STRING # # Program creates reciprocal links dynamically back to the # Zacks Investor's Window. # $tick = $ENV{'QUERY_STRING'}; require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; require 'edgarlib'; &html_header("Zacks Ticker Results"); # from cgi-lib.pl $ticklen = length($tick); if ($ticklen == 0) { print "cannot continue, no ticker supplied !"; &home; # show standard NYU go-home stuff exit 0; } $tick=~ y/[a-z]/[A-Z]/; # get Ticker in all caps, to prepare for search. $th = $tick."_cvr.html"; # initialize reciprocal link back to Zacks site. print <<EndOfForm; # dump raw HTML to the user - a Form. <TITLE>EDGAR Automated Zacks Filing Retrieval by Ticker $tick </TITLE> <H2>EDGAR Zacks Filing Retrieval by Ticker $tick </h2> <ol> <li>Date Limit: <i> Optional </i> Default is since 1/1/94. <br> <li>Filing Type: <i> Optional </i> Default is "All". <br> <li>Filing Type Override: <i> Optional. </i> If you don't see your filing type on the picklist use this override field. <li>Hit Limit: <i> Optional </i> Default is the first 100 hits. <li>Debugger: <i> Optional </i> Set this "on" if you are having trouble with your browser. </ol> <FORM METHOD="POST" ACTION= "http://edgar.stern.nyu.edu/mgbin/zack_tick.pl?$tick"> Submit choices: <INPUT TYPE="submit" VALUE="Retrieve Filings"> Reset form: <INPUT TYPE="reset" VALUE="Reset">. <p> <i>Date limit:</i> <SELECT NAME="date-range"> <OPTION>Last Week <OPTION>Last Two Weeks <OPTION>Last Month <OPTION>Last Six Months <OPTION>Last Year <OPTION SELECTED>No Limit </SELECT> <p> <a href="http://edgar.stern.nyu.edu/docs/general.html"> <i>Filing Type</i></a> <SELECT NAME="Form-Pick-List"> <OPTION SELECTED> ALL <OPTION> S-3 (Stock or Bond Registration) <OPTION> S-8 (ESOP) <OPTION> 8-12B <OPTION> 8-A12G <OPTION> 8-B12B <OPTION> 8-K (Current Event) <OPTION> 10-12B <OPTION> 10-C <OPTION> 10-K (Annual Report) <OPTION> 10-Q (Quarterly Report) <OPTION> 11-K <OPTION> 13F-E (Mutual Funds Holdings) <OPTION> 14 <OPTION> SC (SC 14D is >= 5% Acq) <OPTION> DEF (DEF 14A is the proxy) <OPTION> 424 <OPTION> 485 (Mutual Fund Prospectuses) <OPTION> NSAR (Semi-Annual Fund Reports) </SELECT> <P> <i>Filing Type Override</i> <INPUT NAME="filing-override"> <p> <b>Limit </b> your search to a maximum? <SELECT NAME="limit-List"> <OPTION> no limit <OPTION> 1000 <OPTION> 500 <OPTION> 250 <OPTION SELECTED> 100 <OPTION> 50 <OPTION> 25 <OPTION> 10 <OPTION> 5 <OPTION> 1 </SELECT> <P> Turn <b> debugging </b> on? <SELECT NAME="debug-option"> <OPTION SELECTED> no debug <OPTION> debug </SELECT> <P> </FORM> <HR> <A HREF="http://iw.zacks.com/firm/$th"> <img src="http://edgar.stern.nyu.edu/icons/z_bck.gif"> Zacks</A> Home page for Ticker $tick <p> <A HREF="http://aw.zacks.com"> <img src="http://edgar.stern.nyu.edu/icons/aw_bck.gif"> Zacks Analyst Watch Home Page</A> <p> <A HREF="http://iw.zacks.com/zir.html"> <img src="http://edgar.stern.nyu.edu/icons/iw_bck.gif"> Zacks Investor Window Home Page</A> <p> <a href="http://edgar.stern.nyu.edu/docs/general.html"> <img src="http://edgar.stern.nyu.edu/art/t_scroll.gif"> More information on Filing Types</a>. <A HREF="http://edgar.stern.nyu.edu/EDGAR.html"> <p> <img src="http://edgar.stern.nyu.edu/icons/torch.gif"> NYU EDGAR Project Home Page.</A> <A HREF="http://edgar.stern.nyu.edu/comment-form.html"> EndOfForm exit 0;

Comments about `zacksnew.pl`

Note the trickery with the variable $th. It is set to $tick."_cvr.html"-for example, XON_cvr.html in the case of Exxon. Why? To link back to the Zacks site as it specified. It promises the validity of the link <A HREF="http://iw.zacks.com/firm/$th"> to represent its company-specific information for any given ticker.

In fact, the link zacksnew.pl?AN (Amoco's ticker symbol is AN) shows a screen similar to that shown in Figure 21.3 but not identical. Take a look at Figure 21.7.

Figure 21.7 : The user presupplies the Amoco ticker symbol AN.

The user had the ticker symbol AN (Amoco) preselected and, after accessing the URL http://edgar.stern.nyu.edu/mgbin/zacksnew.pl?AN, is placed immediately into the filing retrieval form. Then, the parameters are input to execute a company search (Employee Stock Option Plan, ESOP, which is filing type S-8; last six-month search only) and the hotlinks to the Internet Multicasting Service (IMS) archive are assembled in the program that has to do the real work of two binary searches: zack_tick.pl.

Note that zack_tick.pl also is called with the same ticker in the QUERY_STRING. It is sufficiently armed (with the user filing parameters and the ticker) to return the appropriate filings.

A partial screen snapshot of the results of zack_tick.pl is shown in Figure 21.8. Note the custom links back to Zacks resources constructed by the zack_tick.pl program. A good advertisement for binary searches is the fact that this query takes only a few seconds on a Sun Sparc 5.

Figure 21.8 : The user, who started the application with ticker symbol AN (Amoco), finishers with the hotlinked SEC S-8 filing for Amoco.

Listing 21.13 shows the code for zack_tick.pl. Compare it to the code formlynx.pl.b that I presented earlier. Notice the differences:

I need to perform two binary searches now instead of one.

Not all input is from the form. The ticker is in the QUERY_STRING. (An important condition of this interface was that the users never have to type in the ticker; it was understood that if they were located on a certain page (for example, the Amoco page on the Zacks side), the Zacks program would pass Amoco's ticker, AN, to my CGI script zacksnew.pl.)

Listing 21.13. The code for zack_tick.pl.

#!/usr/local/bin/perl # # zack_tick.pl # # ticker comes from the query_string # calling program: zacknew.pl # ############################### sub match{ require 'deweb.pl'; require 'edgarlib'; require 'edgardate'; require 'look.pl'; # Required for binary search require 'ctime.pl'; $Date=&ctime(time); # human-readable date and time from the timestamp. $sdate=chop($Date); # take off the *ludicrous* hard return require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; &html_header("Zacks EDGAR Filing Results"); # from cgi-lib.pl &form_init; # initialize the assoc. array of form_desc $julnum = &j_number(); # Get the input read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); $tick = $ENV{'QUERY_STRING'}; # pick up the Ticker from the URL # Split the name-value pairs @pairs = split(/&/, $buffer); # Create an associative array %form foreach (@pairs) { ($key,$value) = split(/=/,$_); $value = &deweb($value); # clean up the value. $form{$key} = $value; # Check for filing-type override. if ($form{'filing-override'} eq "") { $uform = $form{'Form-Pick-List'}; @newform = split(/\s/,$uform); $uform = $newform[0]; }else{ $uform = $form{'filing-override'}; } $infile = "/usr/users/mark/zacks/ticks.sorted"; # the Ticker database. open(INFILE,$infile) || die "cannot open $infile"; &look(*INFILE, $tick,1,1); # search on ticker supplied in QUERY_STRING $tickctr = 0; while (<INFILE>) { last unless /^$tick\b/i; $tickctr++; @line = split(/:/); $cn = $line[1]; last; # get the heck out if found. } if ($tickctr < 1) { print "Ticker $tick not found in file; cannot continue."; print "<A HREF=\"http://edgar.stern.nyu.edu/mgtest/zc.html\">"; print "Try the Company Search </A>"; &home; exit 1; } print "Company Name on file as: <b>$cn</b>"; @co = split(/\s/,$cn); if ($#co > 1) { # If company only one word, use it, else use first two. $combo = $co[1]." ".$co[2]; } else { $combo = $co[1]; } print "and the search we will run will be on <i>$combo</i><p>"; if($form{'date-range'} eq "Last Week"){ $matchdate=&main'jtod($julnum-7); }elsif($form{'date-range'} eq "Last Two Weeks"){ $matchdate=&main'jtod($julnum-15); }elsif($form{'date-range'} eq "Last Month"){ $matchdate=&main'jtod($julnum-30); }elsif($form{'date-range'} eq "Last Six Months"){ $matchdate=&main'jtod($julnum-180); }elsif($form{'date-range'} eq "Last Year"){ $matchdate=&main'jtod($julnum-360); } else { $matchdate=&main'jtod(2449354); } ($nday, $nmon, $nyear)=split(/-/,$matchdate); if($nmon < 10){ $nmon=join('','0',$nmon); } if($nday< 10){ $nday = join('','0',$nday); } $matchdate=join('',$nyear,$nmon,$nday); $forms = "/usr/local/edgar/web/docs/company.sorted"; $logpath="/web/profile/logs/"; $logname="zacks.log"; $logfile = ">>$logpath$logname"; &aux_vars; # for max_hit counter, and debugging info on/off. &field_head(); # for the Results columns. print"<HR>"; $hitctr = 0; # hit counter variable # print "searching for $comp[1] and form $newform[0] and date $matchdate \n"; open(COMPANY, $forms) || die "cannot open the INPUT FILE"; &look(*COMPANY, $combo,1,1); # get the lookup from the ticker resolver. open(LOGFILE, $logfile) || die "problem opening Log file"; print LOGFILE "$Date | $ENV{REMOTE_HOST} | $ENV{REMOTE_ADDR} | $ENV{HTTP_USER_AGENT} | company: $co[1] | ticker: $tick | form $uform | hit-limit: $hitctr debug flag: $form{'debug-option'} date: $matchdate \n"; close LOGFILE || die "problem closing logfile\n"; while (<COMPANY>){ last unless /^$combo.*/i; # get the co. name from ticker resolver. @line = split(/\s\s+/); if (($line[1] =~ /$uform/i || $uform eq "ALL")&&($line[3] >= $matchdate)) { $company = "<A HREF=ftp://town.hall.org/$line[4]>$line[0]</A>"; @date = split(//,$line[3]); $date = "$date[4]$date[5]-$date[6]$date[7]- $date[0]$date[1]$date[2]$date[3]"; $hitctr++; if ($hitctr > $form{'limit-List'}) { $hitctr-; print "User limit of $hitctr reached; exiting."; last; } print "<pre>"; # # Pick up the form description. (this is a global subroutine using the # form description associative array) # $form_desc = &form_desc($line[1]); printf(" %s %-10s %-25s %-25s",$date,$line[1], $form_desc,$company); print"</pre>"; } # end of IF } # end of WHILE close(COMPANY) || die "cannot close the INPUT FILE"; # $tick =~ y/[a-z]/[A-Z]/; $th = $tick."_cvr.html"; print "<hr>"; $hnew = $hitctr; print "Your search had <b>$hnew</b> hit(s).<p>"; print "<A HREF=\"http://iw.zacks.com/firm/$th\">"; print "<img src=\"http://edgar.stern.nyu.edu/icons/z_bck.gif\">"; print" Zacks</A> $combo Home"; print "<p>"; print "<A HREF=\"http://aw.zacks.com/\">"; print "<img src=\"http://edgar.stern.nyu.edu/icons/aw_bck.gif\">"; print "Zacks Analyst Watch Home Page</A> "; print "<p>"; print "<A HREF=\"http://iw.zacks.com/zir.html\">"; print "<img src=\"http://edgar.stern.nyu.edu/icons/iw_bck.gif\">"; print "Zacks Investor Window Home Page</A> "; &home("torch.gif","EDGAR NYU Development Site"); # &nyu_trailer; # exit 1; } #end of "eval &match" sub field_head{ $fhdr="<B>Form</B>"; $chdr="<B> Company Name</B>"; $shdr="<B>Form Description </B>"; $dhdr="<B>Date Filed</B>"; print "<pre>"; printf(" %-10s %-10s %-22s %-30s",$dhdr, $fhdr, $shdr, $chdr); print "</pre>"; } # # sub j_number{ local($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime; $smon=$mon+1; $syear=$year+1900; $timedate=join('-',$mday,$smon,$syear); &main'dtoj($timedate); } sub debug { print "<pre>"; print "Debugging Information \n"; print "*----------------------------------* \n"; print "User Limit is $form{'limit-List'} \n"; foreach (@pairs) { print "$_ \n"; } print "*----------------------------------* \n"; print "</pre>"; } sub aux_vars { if ($form{'limit-List'} =~ /^no/i) { $form{'limit-List'} = 99999; } # no limit if ($form{'debug-option'}=~ /^debug/i) { &debug; } } eval '&match'; exit 0;

Code Discussion: `zack_tick.pl`

The code's a little lengthy, but not too bad to read. The ticker that came from the QUERY_STRING is looked up in the Zacks ticker database. If the ticker is not found, the program exits immediately with a formatted HTML message. If it is found, the company name is extracted. Then, the same logic as formlynx.pl.b is applied to locate the matching filings on the 21MB index file.

Simpler, but still a useful trick, is this line:

&home("torch.gif","EDGAR NYU Development Site");

The argument "torch.gif" causes the daring purple NYU torch image to appear on the answer rather than our standard and rather boring "back.gif." Similarly, the string "EDGAR NYU Development Site" overrides the default caption.

Final Binary Search Example: The Partial Company-to-Ticker-to-Filing Application

Consider the problem of a user who does not know the ticker symbol offhand, yet would recognize the ticker if it were presented in a picklist.

Assume that the user knows the first few letters of the company. This was the puzzle posed to EDGAR intern Genya Kosoy, and he solved it as shown in these steps:

Present a form virtually identical to Figure 21.1, where the user enters a few starting letters of the company and the search parameters (date range, filing type, and so on). This is on-line at http://edgar.stern.nyu.edu/a1.html.
Preprocess the Zacks ticker database so that it is sorted by company rather than ticker. Look up the user's company name substring and handle these possibilities:
If there is more than one possible completion for the inputted company, dynamically form a picklist of matching companies and their corresponding ticker symbols and CUSIPS (a Standard and Poor's numbering scheme for financial instruments, including stocks, bonds, and options).
If there is only one possible completion, use that company name and proceed immediately to search the filings index.
If there are no completions, note that and return to the starting screen.
After the company is selected from the picklist (or a unique company is determined), the familiar binary search on the filings index occurs.

This scheme presents another problem: how to carry the user's search constraints, entered at the very beginning of the application, forward until it is time to search the filings index. Genya solved this state problem by using hidden variables, which are a special class of CGI form variables. The browser cannot see them but communicates them to the CGI program along with the visible form variables. In this way, hidden variables can be used to maintain state in a complex application.

Company-to-Ticker-to-Filing Application Walkthrough

The application starts with the user entering a few letters of the company name-for example, dig-and leaving the other form parameters at their default values, as shown in Figure 21.9.

Figure 21.9 : The user knows only that the company starts with DIG.

The HTML form a1.html (the complete source can be found in the "Additional Code" section at the end of the chapter) then uses a METHOD=POST to invoke zack3.pl, the program that constructs the picklist of all possible completions of the substring DIG.

Figure 21.10 shows the picklist, consisting of 10 possible completions, sorted in company order; the fields from left to right are company name : ticker symbol : CUSIP.

Figure 21.10 : The user sees 10 completions of DIG and proceeds to select DIGICON.

The user sees that Digicon's ticker symbol is DGC and remembers that it is the correct one. After submitting Digicon, the filings are retrieved as shown in Figure 21.11.

Figure 21.11 : Digicon's filings are retrieved with the original constraints of Figure 21.9 in effect.

The most interesting aspect of the application is the handling by the program zack3.pl of the user's company entry. It turns out that the construction of the dynamic picklist is accomplished by a subroutine within zack3.pl; the script outputs an HTML form consisting of this picklist. After the user picks one company from the list, the form elegantly does a METHOD=POST to itself. Upon reentry, it is determined that a unique company exists for the user's picklist selection, and the filings are retrieved.

If the user inputted a company that has only one unique completion, the program dispenses with the now superfluous picklist and immediately does the binary search on the filings index.

Listing 21.14 shows the code for zack3.pl.

Listing 21.14. The code for zack3.pl.

#!/usr/local/bin/perl # # zack3.pl # # user inputs Company substring in ~genya/a1.html: # # This program forms a picklist if > 1 completion; # immediately does index lookup if = 1 completion, # and outputs warning if 0 completions. # # Mods. # ---- # 4/20/95: move submit button to top of form, per Zacks MG # # sub match{ require 'deweb.pl'; require 'edgarlib'; require 'edgardate'; require 'look.pl'; # Required for binary search require 'ctime.pl'; $Date=&ctime(time); # human-readable date and time from the timestamp. $sdate=chop($Date); require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; &html_header("Zacks Ticker Results"); # from cgi-lib.pl &form_init; # initialize the assoc. array of form_desc $julnum = &j_number(); # Get the input read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); # Split the name-value pairs @pairs = split(/&/, $buffer); foreach (@pairs) { ($key,$value) = split(/=/,$_); $value = &deweb($value); unless ($key eq "company") { ($value, @garbage) = split(/\s/,$value);} $form{$key} = $value; } #overwrite the selection of filing type selected by user if ($form{'filing-override'} eq "") { $uform = $form{'Form-Pick-List'}; }else{ $uform = $form{'filing-override'}; } @compan=split(/:/,$form{'company'}); $form{'company'}=@compan[0]; # # $infile is Zacks's ticker database, sorted by company # $infile = "/web/xref/databases/zacks_company_sorted.dat"; open(INFILE,$infile) || die "cannot open $infile"; &look(*INFILE, $form{'company'},0,1); $ctr = 0; %choice=(); # choice: the completion array while (<INFILE>) { last unless (/\b$form{'company'}.*/i); $ctr++; if (/^$form{'company'}\b.\s+:/i) { @line= split(/:/); $cn = $line[1]; last; } $choice{$ctr}=$_; } if ($ctr > 1) { &new_form1; # subroutine to handle forming picklist exit 1; # quit after picklist formed - it's a new form } sub new_form1 { print ("<h2>Company name $form{company} is not unique!</h2><hr> <br>\n"); print ("Pick your desired company from the list below: \n"); $option="<OPTION>"; print "<FORM METHOD =\"POST\" ACTION= \"http://edgar.stern.nyu.edu/gkbin/zack3.pl\"><br>"; print "<INPUT TYPE =\"submit\" Value= \"Submit\"><p>\n"; print "<SELECT NAME=\"company\" SIZE =15>"; for ($l=1; $l<=$ctr; $l++) # loop to form the picklist entries { @ln=split(/:/,$choice{$l}); print "<option> $ln[0] : $ln[1] : $ln[2]"; } print "<\select><br>"; # # Now make sure to pass the original user filing-search constraints # as hidden variables. # print "<INPUT TYPE=\"hidden\" NAME=\"date-range\" VALUE=\ "$form{'date-range'}\">\n"; print "<INPUT TYPE=\"hidden\" NAME=\"limit-List\" VALUE=\ "$form{'limit-List'}\">\n"; print "<INPUT TYPE=\"hidden\" NAME=\"filing-override\" VALUE=\ "$form{'filing-override'}\">\n"; print "<INPUT TYPE=\"hidden\" NAME=\"Form-Pick-List\" VALUE=\ "$form{'Form-Pick-List'}\">\n"; print "<INPUT TYPE=\"hidden\" NAME=\"debug-option\" VALUE=\ "$form{'debug-option'}\">\n"; print "<br>\n"; print "<br><hr>"; print "Or you can <A HREF=\"http:/edgar.stern.nyu.edu/~genya/a.html\">"; print "Try again</A><br>\n"; &home; # standard NYU subroutine to present go-home link <\FORM> } if ($ctr < 1) { print "Ticker $comp[1] not found in file; cannot continue."; print "<A HREF=\"http://edgar.stern.nyu.edu/~genya/a1.html\">"; print "Try Again! </A>"; &home; exit 1; } if ($form{'company'} eq "") { print "<h1>Error! It seems you did not enter a company.</h1><p>"; print "<A HREF=\"http://edgar.stern.nyu.edu/~genya/a1.html\">"; print "Try Again! </A>"; print "<hr>"; print "However, if you did enter a company, "; print "retry this program with the debugging flag set on "; print "and report your results to us. Some users have reported "; print "troubles with browsers such as Mac Netscape, Netmanage, and "; print "Air Mosaic (Spry). We are anxious to resolve any and all "; print "problems!"; print "<hr>"; print "<a href=\"http://edgar.stern.nyu.edu/\">"; print "<img src=\"http://edgar.stern.nyu.edu/icons/back.gif\">"; print "Return to the Edgar Home Page </A>"; exit 1; # # Once we get here, we're golden, because the company name has been # determined to have a unique completion. Proceed with the binary # search of the Filings Index, using the hidden variables as # user preferences. # print "Company Name on file as: <b>$form{'company'}</b>"; print "and the search we will run will be on <i>$form{'company'}</i><p>"; if($form{'date-range'} eq "Last Week"){ $matchdate=&main'jtod($julnum-7); }elsif($form{'date-range'} eq "Last Two Weeks"){ $matchdate=&main'jtod($julnum-15); }elsif($form{'date-range'} eq "Last Month"){ $matchdate=&main'jtod($julnum-30); }elsif($form{'date-range'} eq "Last Six Months"){ $matchdate=&main'jtod($julnum-180); }elsif($form{'date-range'} eq "Last Year"){ $matchdate=&main'jtod($julnum-360); } else { $matchdate=&main'jtod(2449354); } ($nday, $nmon, $nyear)=split(/-/,$matchdate); if($nmon < 10){ $nmon=join('','0',$nmon); } if($nday< 10){ $nday = join('','0',$nday); } $matchdate=join('',$nyear,$nmon,$nday); $forms = "/usr/local/edgar/web/docs/company.sorted"; $logpath="/web/profile/logs/"; $logname="zacks.log"; $logfile = ">>$logpath$logname"; &aux_vars; # for max_hit counter, and debugging info on/off. &field_head(); # for the Results columns. print"<HR>"; $hitctr = 0; # hit counter variable # # Search on the first * two * words of the company's full name, # to improve accuracy on the filings search. # @comb=split(/\s/, $combo); $combo=$comb[0]." ".$comb[1]; open(COMPANY, $forms) || die "cannot open the INPUT FILE"; &look(*COMPANY, $combo,0,1); # # Keep Logging Information. # open(LOGFILE, $logfile) || die "problem opening Log file"; print LOGFILE "$Date | $ENV{REMOTE_HOST} | $ENV{REMOTE_ADDR} | $ENV{HTTP_USER_AGENT} | ticker: $comp[1] | $newform[0] | hit-limit: $ulimit[1] debug flag: $udebug[1] date: $matchdate \n"; close LOGFILE || die "problem closing logfile\n"; while (<COMPANY>){ last unless /^$combo.*/i; # get the co. name from ticker resolver. @line = split(/\s\s+/); if (($line[1] =~ /$uform/i || $uform eq "ALL")&&($line[3] >= $matchdate)) { $company = "<A HREF=ftp://town.hall.org/$line[4]>$line[0]</A>"; @date = split(//,$line[3]); $date = "$date[4]$date[5]-$date[6]$date[7]- $date[0]$date[1]$date[2]$date[3]"; $hitctr++; if ($hitctr > $form{'limit-List'}) { $hitctr-; # adjust $hitctr to be correct. print "User limit of $hitctr reached; exiting."; last; } print "<pre>"; # # Pick up the form description. # $form_desc = $forms{"$line[1]"}; printf(" %s %-8s %-22s %-25s",$date,$line[1], $form_desc,$company); print"</pre>"; } # end of IF } # end of WHILE close(COMPANY) || die "cannot close the INPUT FILE"; print "<hr>"; print "Your search had <b>$hitctr</b> hit(s).<p>"; print "<A HREF=\"http://edgar.stern.nyu.edu/~genya/a1.html\"> Next</A> Company Search"; &nyu_trailer; # global subroutine exit 1; } #end of &match" sub field_head{ $fhdr="<B>Form</B>"; $chdr="<B> Company Name</B>"; $shdr="<B>Form Description </B>"; $dhdr="<B>Date Filed</B>"; print "<pre>"; printf(" %-10s %-10s %-22s %-30s",$dhdr, $fhdr, $shdr, $chdr); print "</pre>"; } # # sub j_number{ local($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime; $smon=$mon+1; $syear=$year+1900; $timedate=join('-',$mday,$smon,$syear); &main'dtoj($timedate); } sub debug { print "<pre>"; print "Debugging Information \n"; print "*----------------------------------* \n"; print "User Limit is $form{'limit-List'} \n"; foreach (@pairs) { print "$_ \n"; } print "*----------------------------------* \n"; print "</pre>"; } sub aux_vars { if ($form{'limit-List'} =~ /^no/i) { $form{'limit-List'} = 99999; } # no limit means No Limit. if ($form{'debug-option'}=~ /^debug/i) { &debug; } } eval '&match'; exit 0;

Code Observations: `zack3.pl`

This is a good example of maintaining state; it makes sense that the user specifies filing search constraints up front with the company name fragment. Having done that, the constraints are propagated to the final script with no further user action required. The hidden variables were quite straightforward to use, and I would recommend this method over QUERY_STRING and PATH_INFO on the command line to avoid shell overloading and cosmetically unattractive URLs.

In Chapter 24, "Scripting for the Unknown: The Control of Chaos," Eric Tall revisits the question of maintaining state in a more chaotic problem domain.

Perl Version 5

Perl Version 5 represents an advance on many fronts compared to the standard Perl Version 4.03l6. As the reader might expect, numerous Internet sites carry on-line manual pages and more extensive syntax explanations and links to related Perl resources. (See note)

One of Perl 5's strengths is its new object-oriented approach. C++ fans will appreciate Perl 5 treating file handles as objects and allowing, in a straightforward manner, classes and inheritance. Another important achievement is the programmer's capability to embed Perl 5 code in a C or C++ application, and the reverse is handled too: a preprocessor is provided to make Perl 5 aware of custom C or C++ routines.

Perl 5 is much stronger in its database methods implementation. The Perl 4 dbmopen interface has been rewritten to support object-oriented methods-specifically, to allow the programmer to "tie" variables to an object class. The class, then, carries with it a list of access methods permitted. Thus, programmers no longer are stuck with the DBM and NDBM packages to implement database functionality.

Developers with time constraints surely will want to build on an existing relational database engine, however, instead of coding a new object class from scratch. This leads to the next section, which confronts the challenge of building gateway programs to talk to vendor-supplied databases.

Perl and Relational Database Packages

One of the more intriguing areas of software development on the Web is integrating industrial-strength relational database management systems (RDBMS) with Perl scripts. Naturally, big database vendors such as Oracle and Sybase have many global customers, so there is much activity to build transparent toolkits to pass queries into the database engines and cosmetically process the answers coming back. Thankfully, the Structured Query Language (SQL) enables users to communicate with database engines on the command line.(See note) SQL is the standard language to construct queries on tabular data; the rows are the records of the table, and the columns are the fields.

All of the general Web principles of integration design apply when working with RDBMS packages:

Know your application. The developer should have significant experience with the RDBMS package on the command line. This includes passing queries to the database (many packages have idiosyncratic means of inputting SQL to the database server) and receiving and redirecting output from the database.

Be aware of alternatives. The developer should ask, "Do I really need the overhead of an RDBMS to represent my data store on the Web? What are my options?" (I will return to this point.)

Be willing and eager to experiment. Web applications are very performance sensitive. If a database query passed via a Perl gateway is very slow, where are the weaknesses? A poor gateway? Network congestion? A poorly designed underlying database structure? Tinkering with various aspects of the system is a must, as is empirical data collection to help resolve performance woes.

There is one more layer of complexity now: the need for a good database design before the Web integration effort commences. If a database is designed poorly, with badly chosen key fields or crucial indexes left unbuilt, the Web integrator will not be able to succeed. Fast performance times on queries presupposes good database design. Restating this, if the web developer notices that the database interface is handling user input gruesomely slowly, the first step should be to review the underlying database design. Do the indexes make sense stacked up against the most common user queries? Are the tables properly normalized?

Tip

The web developer should not approach an RDBMS integration effort without reviewing the underlying database design. It doesn't hurt, either, to become acquainted with the site database administrators (DBAs).

On-line resources are available to help developers who work with the more common packages-for example, Oracle and Sybase. The GSQL toolkit, provided by the NCSA, is one example of a C programming library with specialized "hooks" to talk to Sybase or Oracle.(See note)

The amount of effort necessary to customize an Internet toolkit, such as GSQL, at a given Web site depends on two factors:

How close is the Web site's database engine to the engine assumed by the toolkit?
Are there idiosyncratic system architecture features at the Web site?

If preliminary analysis suggests that the generic toolkit is not getting the job done, I recommend a step back to overall system design. Usually, the task is clear (accomplishing an interactive gateway to a database with reasonable response time to afford the end user's ad-hoc query capability). I prefer a quick prototyping in this case; put working Perl screens out there for selected users and use their feedback to iterate the screen design and upgrade the query capability.

Tip

One good way to handle SQL upgrades is to create a generic template file. Use an HTML form to capture user input, and then use a Perl gateway program to substitute selected variables in the template SQL with the user's input. The SQL then is complete and ready to pipe into the database engine. The only work left to do is to handle the query answer-a purely cosmetic task.

Illustra: An Unusual RDBMS

The NYU EDGAR project uses Illustra, an object-relational database. It supports ANSI SQL and has significant object-oriented extensions such as user-definable new data types, table inheritance, and more.(See note)

Illustra's API is not particularly well developed, which makes it an appealing target to hack together a simple Web interface in relative peace (not having my work anticipated by 200 other parallel projects is sometimes a good thing).

Figure 21.12 shows a simple interface to a SQL-compliant database-Illustra, in this case.

Figure 21.12 : The user enters parameters that will be assembled into a valid SQL query.

The user can pick the table, select one or more fields to query, select one field to constrain, type in a value to which that field must be equal, and, finally, constrain the date range. In this example, the user is looking at 10-K annual reports for the last 90 days.

The gateway script must assemble these form components into valid SQL.

Listing 21.15 shows the sql_wrapper.pl code.

Listing 21.15. CGI interfacing to SQL: The sql_wrapper.pl code.

#!/usr/local/bin/perl # # sql_wrapper.pl # interface with sqlform.html, builds SQL, # calls a C program...processes and displays output. # 3/18/95 mg. # require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; # use html_header require 'edgarlib'; # use the nyu_trailer # the sql is built in sqlfile. # then the sql runs in the "C" program exec_sql. # then the output is dumped to mosaic screen with MIME header. $ipath = '/usr/users/andrey/illus/documentation'; $outfile = "$ipath/temp2.out"; $sqlfile = "$ipath/sql.in"; $sqlout = "$ipath/sql.out"; open (SQLFILE, ">$sqlfile") || die "cannot open SQL output file"; if (-e $outfile) { 'rm $outfile'; } if (-e $sqlout) { 'rm $sqlout'; } # Get the input read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); # Split the name-value pairs @pairs = split(/&/, $buffer); &html_header("SQL Output"); ($garbage,$table) = split(/=/,$pairs[0]); $form_t=0; while(($form_t<=$#pairs)&&($_=$pairs[$form_t])){ if ($pairs[$form_t] =~ /^constrain/){ # constraint means end of rpt. last; } $form_t++; } $form_adj = $form_t - 1; # now we know how many report fields. foreach $i (1 .. $form_adj) { ($garbage,$valid) = split (/=/,$pairs[$i]); $rfield = $rfield.", ".$valid; # build the list of rpt. fields. # but this gives us leading "," - bad. } substr($rfield,0,1) = " "; # get rid of leading comma. Then all OK. ($garbage,$cfield) = split(/=/,$pairs[$form_t]); # pick up constraint fld. ($garbage,$value) = split(/=/,$pairs[$form_t+1]); # and pick up value. ($garbage,$drange) = split(/=/,$pairs[$form_t+2]); # and pick up date range $sql_string = "select $rfield from $table where $cfield = \'$value\' and current_date - date_filed \< interval \'$drange\' day; \n"; print SQLFILE "$sql_string"; # unfortunately the < character causes problems when outputted to Mosaic # screen. Make it &lt meta-character $newsql = "select $rfield from $table where $cfield = \'$value\' and current_date - date_filed &lt interval \'$drange\' day; "; print "<hr>"; print "Your SQL Query is "; print "<pre> \n"; print "<b> $newsql <\/b> \n"; print "<\pre> "; print "<hr>"; close SQLFILE || die "cannot close SQL FILE"; # exit 0; # debug print "Running the SQL request...\n"; '$ipath/exec_sql edgar < $sqlfile > $sqlout '; # Andrey's C Program. print "Finished with SQL request...\n"; open (SQL_OUT, $sqlout) || die "cannot open the sql answer"; while (<SQL_OUT>) { @line = split(/\t/); print "<pre>\n"; chop($line[0]); # line0 has an annoying TAB (hex 09) $line[0]=~tr/ / /s; $newline = sprintf("%-40s ",$line[0]); $newline =~ s/\s\s+//; print "$line[1]\t<A HREF=ftp://town.hall.org/$line[2]>$newline</A>"; print "</pre>\n"; } &nyu_trailer; &illustra_trailer; exit 0;

Code Discussion: `sql_wrapper.pl`

This program uses a C program that was written to accept SQL input and query the EDGAR database. Look at this line of code:

'$ipath/exec_sql edgar < $sqlfile > $sqlout '; # Andrey's C Program.

The $sqlfile was built (after some contortions) to be syntactically correct SQL. In this case, $sqlfile is equal to the following:

Select company,date_filed from index_10k where sec_form = '10-K' and current_date - date_filed < interval '90' day;

The C program passes this SQL to the database server, and the answer is redirected to an output file. The output file is opened by the Perl program and processed for formatted HTML output. Note the tiny things I had to do throughout to clean up output. Along the way, I discovered that a tab code was fouling up the works (ASCII code 09), and had to chop it out. More seriously, when I echoed the SQL query to the screen, I ran into unexpected difficulties. After further investigation, I found the following apparently innocuous query:

Select company,date_filed from index_10k where sec_form = '10-K' and current_date - date_filed < interval '90' day

It's not a good idea to display this in an HTML-formatted document! The < character completely confuses and baffles the NCSA Mosaic for X client, and things come to a crashing halt. The solution, as you will notice from my brief comment, was to construct a replica of the query for display-with the correct HTML < substituted for <.

Figure 21.13 shows the result. The Illustra database engine returns these records in response to the SQL query, and the answer is captured and formatted by the Perl gateway. The database query results are hotlinked to the IMS filing database.

Figure 21.13 : The results of a Perl-Illustra gateway request.

Before I leave SQL and relational databases, I want to show a more complex SQL query and touch on the issue of tuning queries.

The NYU EDGAR project recently has started to analyze the equity holdings of mutual funds; their shifts over time pose graphical and analytical challenges. Forgive me if I call this fund by the alias Big Fund without hurting the instructional value.

Figure 21.14 shows a prototype query string to answer the question to the following scenario. "Between February 24, 1994, and May 16, 1994, Big Fund made a number of sell decisions on its large portfolio. Assuming unrealistically that Big Fund sold the stocks on February 24, which stock sale decisions were poor at a magnitude of $20 million or more?" Restating this, if Big Fund had not sold the stocks in question on February 24, it would have had on May 16, 1994, at least $20 million more per stock in its coffers. I apologize for the unrealistic assumption, but I offer the example anyway because it can serve as the basis for a useful fund tracker with more underlying data.

Figure 21.14 : The user chooses a magnitade of "loss" (in dollars) and submits the query to the SQL gateway.

The core of the underlying SQL query is interesting:

select old.name,old.cusip, old.shares, new.shares, ((old.shares - new.shares) * new.prc_per_sh) loss from BIGFUND old,BIGFUND new where old.cusip = new.cusip and old.file_date = 940224 and new.file_date = 940516 and old.shares > new.shares and old.prc_per_sh < new.prc_per_sh and

Note that this SQL query is incomplete; it contains a dangling and. I'm missing the final criterion-the one supplied by the user in Figure 21.11. The gateway completes the query simply by concatenating the missing piece. The correct query follows:

select old.name,old.cusip, old.shares, new.shares, ((old.shares - new.shares) * new.prc_per_sh) loss from BIGFUND old,BIGFUND new where old.cusip = new.cusip and old.file_date = 940224 and new.file_date = 940516 and old.shares > new.shares and old.prc_per_sh < new.prc_per_sh and loss < 20000000;

Then the SQL engine can go merrily on its way answering the query.

Figure 21.15 shows the results of the $20-million question.

Figure 21.15 : The SQL server returns the loss leaders, in a manner of speaking, for Big Fund.

Listing 21.16 shows the Perl gateway code sql_loss.pl.

Listing 21.16. The sql_loss.pl code.

#!/usr/local/bin/perl # # sql_loss.pl - interface with mutual_fund 'loss' program # # Note: shows how to join a core SQL query with an ad-hoc # user-supplied end-piece to form a complete, valid SQL query # # Mark Ginsburg 5/95 # require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; # use html_header require 'edgarlib'; # Build the SQL. Then pipe it into msql, and redirect output. # # 'msql' is the Illustra command line interface to the db server # $ipath = '/web/xref/mutual_funds/BIG_FUND/'; $outfile = "$ipath/temp2.out"; $sqlhdr = "$ipath/loss_hdr.sql"; # the core of the SQL query $sqlfill = "$ipath/fill.sql"; # the little piece supplied by user $sqlfull = "$ipath/full.sql"; # the complete SQL query input $sqlans = "$ipath/sql.answer"; # the results of the Query open (SQLFILE, ">$sqlans") || die "cannot open SQL output file"; if (-e $sqlfill) { 'rm $sqlfill'; } if (-e $sqlfull) { 'rm $sqlfull'; } read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); # Split the name-value pairs @pairs = split(/&/, $buffer); &html_header("Mutual Fund Loss Report"); ($garbage,$loss) = split(/=/,$pairs[0]); $loss = &deweb($loss); $loss_phrase = "loss > $loss;"; # build proper SQL end piece open (FILL,">$sqlfill") || die "cannot open fill file"; print FILL $loss_phrase; close (FILL); 'cat $sqlhdr $sqlfill > "$sqlfull"'; # now, the SQL query is complete $q = 'cat "$sqlfull"'; '/web/donnelley/miadmin/bin/msql edgar < "$sqlfull" > $ipath/output' ; open (SQLO,"$ipath/output") || die "cannot open SQL OUTPUT"; format STDOUT = @<<<<<<<<<<<<<<<<<<<<< $@<<<<<<<<<<<<<<<<< $line[1], $los . print "<pre> \n"; print "<b> COMPANY MAGNITUDE OF BAD DECISION </b> \n\n"; while (<SQLO>) { @line = split(/\|/,$_); if ($line[3] =~ /\d+/) { ($num,$exp) = split(/E/,$line[5]); $los = $num * (10**$exp); $los = int($los); write STDOUT; } # print "$line[1] $los \n"; } # # Disclaimer # print "\n\n"; print "<hr> \n"; print "'bad decision' is only a theoretical concept -\n"; print "how much more money the fund would have \n"; print "if they had sold these particular stocks on \n"; print "exactly February 24, 1994. \n\n"; print "The fund still could have made money on the sales \n"; print "because we do not know what day the stocks \n"; print "were actually sold, nor do we know the basis price \n"; print "of the purchases."; print "</pre>"; &nyu_trailer; &illustra_trailer; exit 0;

Observations about `sql_loss.pl`

Note that this program uses a different technique to interface with the SQL engine.

Instead of relying on a C program to "talk" to the database server, a command-line interface is used. Study this code line:

'/web/donnelley/miadmin/bin/msql edgar < "$sqlfull" > $ipath/output' ;

Here, the full SQL query represented in $sqlfull is piped into the msql command-line interface that Illustra offers. edgar is the name of the database that the server connects to, and $ipath/output is the file where the stdout is redirected-that is, the answer to the query. Most RDBMS packages, by the way, offer a similar interface to afford the web developer a convenient mechanism to redirect SQL to stdin and redirect the server's answer from stdout to a file.

Some postprocessing was a little quirky. Some end users might not be fans of the database's preferred scientific notation, for example. The server, for example, would report the number 23,456,789 as 2.3456789E8. The following code takes care of that:

($num,$exp) = split(/E/,$line[5]); # $line[5] is the exp. notation $los = $num * (10**$exp); $los = int($los); # $los looks OK now for output.

Also note my use of the format statement-just another technique to line up the output fields. As I stated previously, the <pre> and </pre> tags still are required to wrap around the formatted section.

The script relied on creating tiny files simply to help me debug faster.

Database Tuning

As it happened, the "loss query" ran exceptionally slowly on a Sun Sparc 5-in excess of 10 minutes per query. Going back to the Big Fund database design, I discovered that I had forgotten to build a B-tree index on the field file_date. This omission was an all-important one; when the index was built, the query (which, at its worst, is a Cartesian product of a 2,400-record table with itself) could make use of the fast B-tree index to increase performance. The query time decreased from more than 10 minutes to less than 30 seconds! My advice stands: A Web site should not allow the role of the DBA to be separate and distinct from the role of the Web RDBMS integrator.

Pros and Cons of Relational Databases on the Web

Relational databases are invaluable for storing large amounts of tabular data, relating the various tables by common key fields, and affording a simple standard query language, SQL, to query the databases. Consequently, Perl scripts can be written without too much effort to construct a SQL query and feed it to the RDBMS engine. The same script then can clean up the database's response for Web consumption. These products can take up significant amounts of machine resources, however. If a machine is a popular Web site, it has to handle many Internet accesses per day. If some of the accesses are kicking off relational queries, the machine easily can become overloaded. Dedicating a second machine as a database server is reasonable, but might stretch the budget. The web developer should consider carefully whether Perl database functionality (especially Perl 5) can be used in place of a third-party database engine. If a relational database is already in place at a site, it is the preferred route, but it then becomes imperative that the Web developer be intimately familiar with its behavior on the command line. Countless hours of debugging are saved if the Web developer knows essential database features such as the following:

Which environmental variables the RDBMS engine can use
SQL query-handling characteristics on the command line
Idiosyncratic behavior of the database server daemon
Maximum users (if any) on the database site license
General database tuning and performance issues

The question of performance is extremely difficult when interfacing relational databases on the Web due to the interplay between Net congestion, Perl gateway script behavior, and database engine response time. Empirical analysis is mandatory. If the relational database is simply too slow to function properly in real time, a faster Web server or the purchase of a dedicated second box for the database might be necessary. Vendors are bringing to market more robust database access products-for example, those using the Common Object Request Broker Architecture (CORBA) specification. A request broker can queue and prioritize requests and can help alleviate HTTP server load. In addition, look to these products to ease the task of tracing errors or handling other unexpected results from Web-based queries.

The need for the web developer to wear the hat of database guru in this environment is both a blessing and a curse. The addition of new skills is always a good thing, but there simply might not be enough time to properly understand the inner mysteries of the site's package.

Additional Code

Here I include some code fragments that I mentioned earlier in the chapter. I have supplemented the CD-ROM with an on-line code page at http://edgar.stern.nyu.edu/adv-soft/book.html.

look.pl This standard binary search routine comes standard with the Perl 4 or Perl 5 distribution; I repeat it here in Listing 21.17 because many readers asked me for its source.

Listing 21.17. The look.pl code.

;# Usage: &look(*FILEHANDLE,$key,$dict,$fold) ;# Sets file position in FILEHANDLE to be first line greater than or equal ;# (stringwise) to $key. Pass flags for dictionary order and case folding. sub look { local(*FH,$key,$dict,$fold) = @_; local($max,$min,$mid,$_); local($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) = stat(FH); $blksize = 8192 unless $blksize; $key =~ s/[^\w\s]//g if $dict; $key =~ y/A-Z/a-z/ if $fold; $max = int($size / $blksize); while ($max - $min > 1) { $mid = int(($max + $min) / 2); seek(FH,$mid * $blksize,0); $_ = <FH> if $mid;# probably a partial line $_ = <FH>; chop; s/[^\w\s]//g if $dict; y/A-Z/a-z/ if $fold; if ($_ lt $key) { $min = $mid; } else { $max = $mid; } } $min *= $blksize; seek(FH,$min,0); <FH> if $min; while (<FH>) { chop; s/[^\w\s]//g if $dict; y/A-Z/a-z/ if $fold; last if $_ ge $key; $min = tell(FH); } seek(FH,$min,0); $min; } 1;

cgi-lib.pl This venerable Perl 4 form parsing library has become largely obsolete by the arrival of its Perl 5 superset, CGI.pm, but it still is very useful, as Listing 21.18 shows.

Listing 21.18. The cgi-lib.pl code.

# # file: cgi-lib.pl # # desc: This library deals with basic CGI POST or GET method request # elements such as those delivered by an HTTPD form, i.e. a url # encoded line: a=b&b=c&c=d # Also handles <ISINDEX> GET requests. # Needs a 'require "cgi-lib.pl";' line in the main script # # # parse_request reads the POST or GET request from STDIN, and then splits # it into its name=value pairs. Special test for <ISINDEX> input. # sub parse_request { if ($ENV{'REQUEST_METHOD'} eq "POST") { # assumes read gets everything!! read(STDIN, $raw_query, $ENV{'CONTENT_LENGTH'}); } elsif ($ENV{'REQUEST_METHOD'} eq "GET" ) { $raw_query = $ENV{'QUERY_STRING'}; } else { # unrecognized request method return; } # Decode HEX values and spaces, if any if ($raw_query !~ /[&=]/) { # handle <ISINDEX> input $isindex = $raw_query; &decode_url($isindex); } else { %query = &decode_url(split(/[&=]/, $raw_query)); } } # #Decode a URL encoded string or array of strings #+ -> space #%xx -> character xx # sub decode_url { foreach (@_) { tr/+/ /; s/%(..)/pack("c",hex($1))/ge; } @_; } # # html_header sends an HTML header for the document to be returned # and the user can pass a parameter for use as the HTML Title sub html_header { local($title) = @_; print "Content-type: text/html\n\n"; print "<html><head>\n"; print "<title>$title</title>\n"; print "</head>\n<body>\n"; } # # html_trailer sends the HTML trailing material to STDOUT. # sub html_trailer { local($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime; local(@days) = ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday', 'Saturday'); local(@months) = ('January', 'February', 'March', 'April', 'May', 'June','July', 'August', 'September', 'October', 'November', 'December'); print "<p>\nGenerated by the NYU Stern School of Business"; print " on ", sprintf("%02d:%02d:%02d on %s %d, %d", $hour, $min, $sec, $months[$mon], $mday, 1900+$year); print "</body></html>\n"; } # keep require happy 1;

edgarlib The standard library of routines used by the NYU EDGAR development project; Listing 21.19 includes useful headers, trailers, and other goodies.

Listing 21.19. The edgarlib code.

#-------------------------# # Standard EDGAR Routines # #-------------------------# # # Supply the go-home and back.gif link. # sub home{ local ($gif,$text) = @_; # if nothing supplied, set default to back.gif and vanilla caption. if ($#_ < 0) { $gif = "back.gif"; $text = "Return to our Home Page"; } print "<HR>"; print "<a href=\"http://edgar.stern.nyu.edu/\">"; print "<img src=\"http://edgar.stern.nyu.edu/icons/$gif\">"; print "$text</A>"; print "<HR>"; } # sub nyu_trailer{ local($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime; local(@days) = ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'); local(@months) = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'); print "<center>"; print "<p><a href=\"http://www.stern.nyu.edu/\"><img src=\"http://edgar.stern.nyu.edu/icons/torch.gif\"></a> <p>"; print "</center>"; print "<p>\nGenerated by the <a href=\"http://is-2.stern.nyu.edu/\"> NYU Stern School of Business Information Systems Department</a>"; print " on ", sprintf("%02d:%02d:%02d on %s %d, %d", $hour, $min, $sec,$months[$mon], $mday, 1900+$year); print " with CGI scripts - annotated code may be found in "; print "<a href=\"http://www.rpi.edu/~decemj/works/wdg.html\"> this book </a>."; print "</body></html>\n"; print "<p> Note: if you cannot retrieve a filing, retry this routine and "; print "select the alternative data site. Our routines give the user a choice"; print "of which data site to use. Try both before registering a problem report."; } sub illustra_trailer{ print "This query made possible by a grant from \n"; print "<p>"; print "<a href=\"http://www.illustra.com\">\n"; print "<img src=\"http://edgar.stern.nyu.edu/icons/small_illustra.gif\">"; print " the Illustra Database people <\/a>"; } # # Len Zacks AW and IW stuff # sub zacks_trailer{ print"<A HREF=\"http://aw.zacks.com\">"; print"<p>"; print"<img src=\"http://edgar.stern.nyu.edu/icons/aw_bck.gif\">"; print"Return to Zacks Analyst Watch.</a>"; print"<p>"; print"<A HREF=\"http://iw.zacks.com/\">"; print"<p>"; print"<img src=\"http://edgar.stern.nyu.edu/icons/iw_bck.gif\">"; print"Return to Zacks Investors Window.</a>"; print"<p>"; } sub zacks_sponsors{ print"<A HREF=\"http://aw.zacks.com\">"; print"<img src=\"http://edgar.stern.nyu.edu/icons/aw_bck.gif\">"; print "Zacks Investment Research </a> furnished essential data for this system."; } sub sendmail { local($to,$subject,$body) = @_; open(PIPE, "|/bin/mail '$to'") || die "cannot open mail pipe"; print PIPE "Precedence: junk\n"; print PIPE "Subject: $subject\n\n"; print PIPE "$body"; close(PIPE) || die "cannot close mail pipe"; } 1; # it is CRITICAL to end a subroutine library with a 1;

deweb.pl This routine is in fact obsolete by the standard cgi-lib.pl or any other form-parsing package. However, it is instructive to see how the encoded variables are replaced by their clear text equivalents:

sub deweb{ local($name)=@_; $name=~s/\%20/ /g; $name=~s/\%2B/+/g; $name=~s/\%3A/:/g; $name=~s/\%26/&/g; $name=~s/\%2C/,/g; $name=~s/\%28/(/g; $name=~s/\%29/)/g; $name=~s/\%2F/\//g; $name; }

a1.html EDGAR filing retrieval by company name (see Listing 21.20).

Listing 21.20. EDGAR filing retrieval by company name: The a1.html code.

<TITLE>EDGAR Zacks Filing Retrieval by Company Name </TITLE> <H2>EDGAR Zacks Filing Retrieval by Company Name </h2> <ol> <li>Company Name: <b> Required. </b> Cannot exceed 5 characters. <br> <li>Date Limit: <i> Optional </i> Default is since 1/1/94. <br> <li>Filing Type: <i> Optional </i> Default is "All". <br> <li>Filing Type Override: <i> Optional. </i> If you don't see your filing type on the picklist use this override field. <li>Hit Limit: <i> Optional </i> Default is the first 100 hits. <li>Debugger: <i> Optional </i> Set this "on" if you are having trouble with your browser. </ol> <p> <FORM METHOD="POST" ACTION="http://edgar.stern.nyu.edu/gkbin/zack3.pl"> <b>Company Name:</b> <INPUT NAME="company" size=25 maxlength=25> <P> Submit choices: <INPUT TYPE="submit" VALUE="Retrieve Filings">. Reset form: <INPUT TYPE="reset" VALUE="Reset">. <p> <i>Date limit:</i> <SELECT NAME="date-range"> <OPTION>Last Week <OPTION>Last Two Weeks <OPTION>Last Month <OPTION>Last Six Months <OPTION>Last Year <OPTION SELECTED>No Limit </SELECT> <p> <a href="http://edgar.stern.nyu.edu/docs/general.html"> <i>Filing Type</i></a> <SELECT NAME="Form-Pick-List"> <OPTION SELECTED> ALL <OPTION> S-3 (Stock or Bond Registration) <OPTION> S-8 <OPTION> 8-12B <OPTION> 8-A12G <OPTION> 8-B12B <OPTION> 8-K (Current Event) <OPTION> 10-12B <OPTION> 10-C <OPTION> 10-K (Annual Report) <OPTION> 10-Q (Quarterly Report) <OPTION> 11-K <OPTION> 13F-E (Mutual Funds Holdings) <OPTION> 14 <OPTION> SC (SC 14D is >= 5% Acq) <OPTION> DEF (DEF 14A is the proxy) <OPTION> 424 <OPTION> 485 (Mutual Fund Prospectuses) </SELECT> <P> <i>Filing Type Override</i> <INPUT NAME="filing-override"> <p> <b>Limit </b> your search to a maximum? <SELECT NAME="limit-List"> <OPTION> no limit <OPTION> 1000 <OPTION> 500 <OPTION> 250 <OPTION SELECTED> 100 <OPTION> 50 <OPTION> 25 <OPTION> 10 <OPTION> 5 <OPTION> 1 </SELECT> <P> Turn <b> debugging </b> on? <SELECT NAME="debug-option"> <OPTION SELECTED> no debug <OPTION> debug </SELECT> <P> </FORM> <HR> <A HREF="http://edgar.stern.nyu.edu/comment-form.html"> <p> <img src="http://edgar.stern.nyu.edu/icons/redball.gif"> Send comments to the layout designer.</a> <p> <a href="http://edgar.stern.nyu.edu/docs/general.html"> <img src="http://edgar.stern.nyu.edu/art/t_scroll.gif"> More information on Filing Types</a>. <A HREF="http://edgar.stern.nyu.edu/EDGAR.html"> <p> <img src="http://edgar.stern.nyu.edu/icons/back.gif"> Return to our Home Page.</A> <A HREF="http://edgar.stern.nyu.edu/comment-form.html">

sortform.pl This script uses the UNIX sort command to present the filings in reverse chronological order (see Listing 21.21). It writes the temporary sort file to disk to assist in debugging.

Listing 21.21. The sortform.pl code.

#!/usr/local/bin/perl # # sortform.pl - based on formlynx.pl.b # # 5/95 Peter Leung: Modified script to output forms in reverse # chronological order. ###################################################################### # NYU Edgar Development Project ###################################################################### sub match{ require 'edgarlib'; require 'edgardate'; require 'cgi-lib.pl'; # require 'look.pl'; # for binary search require 'ctime.pl'; # &form_init; $Date=&ctime(time); # human-readable date and time from the timestamp. $sdate=chop($Date); # Print out a content-type for HTTP/1.0 compatibility # print "Content-type: text/html\n\n"; #This line will call method to change current date to Julian number and #assign it to variable '$julnum' # $julnum = &j_number(); # Get the input read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); # Split the name-value pairs @pairs = split(/&/, $buffer); # Create associate array foreach (@pairs) { ($key, $value) = split (/=/, $_); $value=&deweb($value); $form{$key} = $value; } # overwrite the selection of form selected by user, # if the user enters a special form not on the picklist. if ($form{'form'} eq "") { ($userform, @garbage) = split (/\s/, $form{'Form-Pick-List'}); } else { $userform = $form{'form'}; } $userform =~ tr/[a-z]/[A-Z]/; # make sure to convert to uppercase. $date = $form{'date-range'}; if($date eq "Now"){ $matchdate=&main'jtod($julnum); }elsif($date eq "Last Week"){ $matchdate=&main'jtod($julnum-7); }elsif($date eq "Last Two Weeks"){ $matchdate=&main'jtod($julnum-15); }elsif($date eq "Last Month"){ $matchdate=&main'jtod($julnum-30); }elsif($date eq "Last Three Months"){ $matchdate=&main'jtod($julnum-90); }elsif($date eq "Last Six Months"){ $matchdate=&main'jtod($julnum-180); }elsif($date eq "Last Nine Months"){ $matchdate=&main'jtod($julnum-270); }elsif($date eq "Last Year"){ $matchdate=&main'jtod($julnum-360); }else{ $matchdate=&main'jtod(2449354); } ($nday, $nmon, $nyear)=split(/-/,$matchdate); if($nmon < 10){ $nmon=join('','0',$nmon); } if($nday< 10){ $nday = join('','0',$nday); } $matchdate=join('',$nyear,$nmon,$nday); $forms="/web/research/master/forms.sorted"; # the edgar master index # # field layout: 0 = form 1 = company 2 = CIK 3 = date 4 = path # &main_head(); # output Report Header &field_head(); # output Field Headers for Report print"<HR>"; $hitctr = 0; # hit counter variable $temp="/web/profile/logs/formonly.temp"; open(TEMP, ">$temp") || die "Can't open the TEMP FILE"; open(COMPANY, $forms) || die "cannot open the INPUT FILE"; $logpath = "/web/profile/logs/"; $logname = "formonly.log"; $logfile = ">>$logpath$logname"; open(LOGFILE, $logfile) || die "problem opening Log file"; print LOGFILE "$Date | $ENV{REMOTE_HOST} | $ENV{REMOTE_ADDR} | $ENV{HTTP_USER_AGENT} | $userform | $matchdate \n"; close LOGFILE || die "problem closing logfile\n"; $colon = ":"; &look(*COMPANY, $userform,0,0); while (<COMPANY>){ last unless /^$userform.*/; # @line = split(/:/); if ((/$form[1]/i || $form[1] eq "ALL") &&($line[3] >= $matchdate)) { $pfx="A HREF=ftp://town.hall.org/"; $line[4] =~ tr/ //d; # get rid of the mysterious leading # space in the path. $colen=length($line[1]); $coadj=substr($line[1],1,$colen); $copath="\<".$pfx.$line[4].">".$coadj."</A>"; $company = "<A HREF=ftp://town.hall.org/$line[4]>$coadj</A>"; @date = split(//,$line[3]); $date = "$date[5]$date[6]-$date[7]$date[8]- $date[1]$date[2]$date[3]$date[4]"; $formtype = $line[0]; $hitctr++; $form_desc = &form_desc($formtype); print TEMP "$line[3] | $date | $formtype |$form_desc | $copath \n"; } # end of IF } # end of WHILE close(COMPANY) || die "cannot close the INPUT FILE"; # this is critical to # reset the line pointer; # else erratic. close(TEMP) || die "Cannot close the TEMP FILE"; $sorted = "/web/profile/logs/formonly.sorted"; if (-e $sorted) {'rm $sorted';} 'sort /web/profile/logs/formonly.temp -r -o /web/profile/logs/formonly.sorted'; &print_output; print "Your search had <b>$hitctr</b> hit(s).<p>"; print "<A HREF=\"http://edgar.stern.nyu.edu/formonly.html\"> Next</A> Form search!"; &trailer(); exit 1; } #end of "eval &match" sub main_head{ print "<body>"; print "<Title>Edgar Forms Search</Title>"; print "<H1>Edgar Forms Search</H1>"; print "<p>"; print "</body>"; } #end of main_head method sub field_head{ $fhdr="<B>FORMS</B>"; $chdr="<B>COMPANY NAME</B>"; $shdr="<B>FORM DESCRIPTION</B>"; $dhdr="<B>DATE FILED</B>"; print "<pre>"; printf(" %-10s %-10s %-10s %s",$dhdr, $fhdr, $shdr, $chdr); print "</pre>"; } sub j_number{ local($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime; $smon=$mon+1; $syear=$year+1900; $timedate=join('-',$mday,$smon,$syear); &main'dtoj($timedate); } sub form_desc { $local=$_[0]; $desc=$forms{$local}; if ($desc eq ""){ $desc = "Other"; } $desc; } sub print_output { $temps = "/web/profile/logs/formonly.sorted"; open(TEMP, $temps) || die "cannot open the sorted temp file for r-o"; while (<TEMP>) { @linet = split(/\|/,$_); # split on | character $foobar = $linet[2]; print "<pre>"; printf("%s %-12s %-21s %s",$linet[1],$linet[2],$linet[3],$linet[4]); print"</pre>"; } close(TEMP) || die "cannot close temp file r-o"; } sub trailer{ print "<HR>"; print "<a href=\"http://edgar.stern.nyu.edu/\">"; print "<img src=\"http://edgar.stern.nyu.edu/icons/back.gif\">"; print "Return to our home page</A>"; print "<HR>"; local($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime; local(@days) = ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'); local(@months) = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'); print "<p>\nGenerated by the <B>NYU Stern School</B> of Business"; print " on ", sprintf("%02d:%02d:%02d on %s %d, %d", $hour, $min, $sec, $months[$mon], $mday, 1900+$year); print "</body></html>\n"; } eval '&match'; exit 0; # it's nicer to exit with error code of 0.

edgarftp.pl This routine is useful for automating an FTP session with Perl; it adds important error routines to ftplib.pl, as shown in Listing 21.22.

Listing 21.22. The edgarftp.pl code.

# ------------------------------------------------------------ # edgarftp : package to interface w/sub's in ftplib.pl # purpose is to set up some standard $variables # and to add error-checking since ftplib.pl will report nothing # # subroutines in this package: # check_errors print error (if any) + closes ftp session # # get_user_info prompts for $userid + $password (displayed ON screen) # logon_to_server logs onto remote ftp server # needs $address $userid $password # logout closes ftp session # get_address prompts for $address # ftp_the_file gets $filename from remote server # goto_directory cds to $directory # get_directory puts name of all directories in the current working # directory and puts in @??? array # print_directory prints out @??? array # # each subroutine is called w/o parameters: put values in $variables first # ------------------------------------------------------------ package edftp; ## variables # $userid # $password # $address # $filename # $directory SUBS: { sub display_vars {print "address = $address\n"; print "userid = $userid\n"; print "password = $password\n"; print "filename = $filename\n"; print "directory = $directory\n"; } sub check_errors { $message = &ftp'error; if ($message) { print LOG "message = $message\n"; &logout; exit 1; } } sub get_user_info { print "Userid: "; $userid = <STDIN>; chop($userid); print "Password: "; $password = <STDIN>; chop($password); } sub logon_to_server { &ftp'open("$address","$userid","$password"); &check_errors; } sub logout { &ftp'close; &check_errors; } sub get_address { print "IP address: "; $address = <STDIN>; chop($address); } sub ftp_the_file { &ftp'get("$filename"); &check_errors;} sub goto_directory { &ftp'cwd("$directory"); &check_errors; } sub get_directory { @file = &ftp'dir; @file; } sub print_directory { $number=@file; $counter=0; while ($counter < $number) { $_ = $file[$counter]; if (/^d/) {print "$file[$counter]\n";} $counter++; } } } ## end of SUBS: 1; ## 'required for packages'

Gateway Programming Libraries and Databases Check

In a script that searches a flat file, the web developer should be able to use the technique of a binary search to maximize performance. All necessary preprocessing of the file(s) to be searched should be done off-line to accommodate efficient retrieval.
The developer should have a sense of when a flat file structure is not enough and a database engine should be used instead.
Both the Perl printf and format statements are useful ways to format CGI output.
Providing a debugging option in an HTML form is a good idea to help anticipate the problems that unusual browsers might cause.
The advantages of judicious use of subroutines (either written in-house or available in standard Perl packages) should be clear to the developer.
When interfacing Perl with another software package, such as a database engine, the developer first should become familiar with the package's command-line behavior. The Perl gateway can be simulated at first with command-line arguments and then iteratively debugged in the Web environment.

Footnotes

The InfoSeek search engine bills itself as "a comprehensive and accurate WWW search engine" and adds, "You can type your search in plain English or just enter key words and phrases." The InfoSeek site is at http://www.infoseek.com/. 1996 saw a burgeoning of search engine options; the popular Excite (Architext) is at http://www.excite.com, and we also now are seeing combined search plus Web site reviews at Magellan's site (http://magellan.mckinley.com/), for example.
A potpourri of information on accessing Sybase databases with Perl routines is located at http://www.adp.unc.edu/info/sybperl.html. Sybase has more information on the web.sql product at http://www.sybase.com/products/internet/websql/.
The EDGAR Internet project was a National Science Foundation-funded endeavor. The stated goals were "to enable wide dissemination and support all levels of user access to the corporate electronic filings submitted to the Securities and Exchange Commission (SEC), to identify and understand the requirements for broad public access, to identify and implement applications which operate on the large document database and synthesize reports based on information across multiple filings, and to understand patterns of access to the EDGAR database." Carl Malamud at the Internet Multicasting Service (http://www.town.hall.org) and Ajit Kambil at NYU's Stern School of Business, Information Systems Department, were the co-principal investigators. By any measure, the project was an unqualified success. The NYU Development Web site-now funded by Disclosure, Fame, RR Donnelley, and others-at http://edgar.stern.nyu.edu/ (which provides front-end tools to access the SEC filings), has experienced steady growth and now is serving more than 25,000 accesses every weekday. Further information is available from Ajit Kambil (akambil@stern.nyu.edu).
A binary search divides the file in half in every search iteration. If the user inputs a query starting with ccc, the million-record file is divided into two 500,000-record pieces. The higher half is discarded and the remaining 500,000-record half file further is subdivided until the key ccc is located. Because 2 to the 20th power is 1,048,576 and 2 to the 19th power is 524,288, I need at most 20 iterations to find the requested key.
The Zacks Investment Research Analyst Watch service is at http://aw.zacks.com/, and the Zacks Investor's Window is at http://iw.zacks.com/. Susan Krivulis stays busy as chief of operations of the Zacks Data Processing complex.
On-line Perl manual pages are perfect for the Web. One such site is http://rhine.ece.utexas.edu/~kschu/perlman.html. Metronet is a popular provider of Perl information for both Perl 4.036 and Perl 5. Its Perl 5 manual page is http://www.metronet.com/0/perlinfo/perl5/manual/perl.html.
Numerous on-line resources enable you to learn more about SQL. http://www.jcc.com/sql_stnd.html "is designed to be a central source of information about the SQL standards process and its current state"; http://waltz.ncsl.nist.gov/~len/sql_info.html is another excellent site. A good book on the 1992 SQL standard is Understanding the New SQL: A Complete Guide, by Jim Melton and Alan R. Simon, Morgan-Kaufmann Publishers, San Mateo, CA, 94403, USA.
The NCSA GSQL Toolkit Mosaic-SQL gateway is on-line at http://www.ncsa.uiuc.edu/SDG/People/jason/pub/gsql/starthere.html, and includes interesting Oracle and Sybase examples.
The Illustra home page (http://www.illustra.com/) provides more information about this company, founded in 1992. It writes, "Illustra represents the commercialization of the University of California's breakthrough POSTGRES database research project under the direction of Dr. Michael Stonebraker." In 1996, Illustra was acquired by the large database vendor Informix.

Chapter 21

Gateway Programming I: Programming Libraries and Databases

CONTENTS

form_init

nyu_trailer

html_header

home

Binary Searches

Binary Search Example I: A Simple Rolodex Application

Binary Search Example II: A Corporate Filings Lookup

Observations about formlynx.pl.b

Comments about zacksnew.pl

Code Discussion: zack_tick.pl

Company-to-Ticker-to-Filing Application Walkthrough

Code Observations: zack3.pl

Code Discussion: sql_wrapper.pl

Observations about sql_loss.pl

Database Tuning

Footnotes

`form_init`

`nyu_trailer`

`html_header`

`home`

Observations about `formlynx.pl.b`

Comments about `zacksnew.pl`

Code Discussion: `zack_tick.pl`

Code Observations: `zack3.pl`

Code Discussion: `sql_wrapper.pl`

Observations about `sql_loss.pl`