Chapter 22 Gateway Programming II: ext Search and Retrieval Tools

Philosophies of Text Search and Retrieval on the Web
Introduction to WAIS
Using WAIS to Index a Data Archive for the Web
freeWAIS-sf
Building a freeWAIS-sf WAIS Index: HTML Extensions
Spiders and Robots: Distributed Information Retrieval
Indexing a Corporate Intranet: A Case Study
Introduction to Glimpse
Harvest
Text Search and Retrieval Tools Check

The need to query a large textual information source is a common one in the Information Age. Alexander the Great probably thought about it in his dream to build the great library at Alexandria, and in more modern times, the field of library science has evolved to study methods of indexing text resources, including efficiency and accuracy of search. In this chapter, I examine techniques to interface with both simple and extremely sophisticated text indexing and retrieval tools, and discuss strengths and weaknesses therein. In a distributed hypermedia space such as the Web, efficiency of search is a critical concern. The efficiency must be weighed against the constraints of network load between machines and processing loads placed on busy servers, however. There is the further constraint of physical disk storage to consider: If the indexing tool creates a large index relative to the underlying data, disk space soon might become scarce.

On the corporate intranet level, there is the question of whether a single central server fetching documents from various satellite servers can effectively index the firm's documents and provide links back to the document origins. One architectural solution to this problem is discussed in this chapter. This is the year of the great race toward distributed search processing, where agents can be launched to read satellite indexes instead of tying up the network with document fetches. We're not quite there yet, however, and must develop efficient centralized solutions in the interim.

The developer also must anticipate site-specific search requirements. Is it likely that end users will not know exactly how to spell keywords in the information store? In that case, a tool that provides approximation search (error tolerance) should be used. Is the site's information store constantly growing? Then it would be appropriate to incrementally update the index-in other words, ensure that the indexing tool supports incremental updates, which are much faster than rebuilding the whole index from scratch.

More fundamentally, the developer must appraise the information store provided at the site: Is it a heterogeneous archive (such as the data archive of SEC corporate filings) or is it tabular data, more appropriate for the database models discussed in Chapter 21?

If a text-indexing tool is chosen, it is both possible and desirable to collect empirical data on the additional server load imposed by the tool, average response time from the text search engine server to the end user's search terms, and accuracy of the response (did the answer suit the question?). As will become clear in this chapter, the subsystems that make up indexing and retrieval are highly modular and can be viewed as experimental tools at the developer's disposal. If one engine does not fit the bill (in terms of accuracy, response time, or resource usage), another one can be substituted.

Philosophies of Text Search and Retrieval on the Web

If no software packages existed to accommodate keyword search and retrieval on one or more Web documents, one do-it-yourself solution might be to construct a companion file of keywords for each document to be indexed. Then, a simple lookup Perl routine could perform a lookup on the keywords and find the appropriate full-text documents.

Storing keywords as plain text is a waste of disk space, however. Powerful indexing engines exist to create highly compressed keyword files, and easy-to-use interface tools exist to provide the glue between the back-end indexes and the front-end Web page-how the user interfaces with the tool.

Choosing an indexing tool on a UNIX workstation is a pleasure because some high-quality, freely available packages are available on the Internet, and they are enhanced quite often. Proprietary alternatives involve lengthier cycle times between vendor releases as a rule of thumb.

Before I discuss indexing tools, though, I should mention an interesting alternative that completely bypasses the indexing step: Oscar Nierstrasz's htgrep Perl package, which enables you to retrieve keywords from a single HTML document.

Htgrep works best when the HTML document is a large one; this package allows the user to enter a Perl regular expression, and the result is shown, by default, on a paragraph-by-paragraph basis. I elected to test the package on the Abiomed Corporate Annual Report disclosure document, which I had previously HTMLized with a Perl conversion routine.

Figure 22.1 shows the htgrep Perl regular expression search window, which acts on the file abiomed.html.

Figure 22.1 : The htgrep initial query screen

Note that the URL referenced in Figure 22.1 is http://edgar.stern.nyu.edu/mgbin/htgrep/file=abiomed.html.

The htgrep package is smart enough to recognize that no keyword query string is present (it would be of the form file=filename?keyword), and it reacts appropriately, showing a textbox and prompting for user input.

The user elects to search on the character string Angioflex, and the results are shown in Figure 22.2.

Figure 22.2 : The htgrep query results from the Angioflex keyword search on abiomed.html.

The results are shown as two paragraphs; these are the two HTML blocks where the Angioflex keyword occurred. Htgrep's behavior can be modified with the use of tags in the URL. For example, if I want a line-by-line output instead of the default paragraph blocks, I want a maximum of 250 returned hits, and I want to display a custom HTML header file, welcome.html, before the query output, I can specify the following URL:

http://edgar.stern.nyu.edu/mgbin/htgrep/file=abiomed.html&linemode= yes&max=250&hdr=welcome.html?Angioflex

Figure 22.3 shows the results of this text search.

Figure 22.3 : The hygrep package can be modified with PATH_INFO tag qualifiers.

Note how the welcome.html header has the effect of suppressing the input box for another keyword search. The reader is referred to the cuiwww.unige.ch Web site, where htgrep is put to good use as a front-end to query the mammoth unige-pages.html document on that server.

The majority of web developers, however, will want to index batches of files at their sites in one pass and be able to tell the indexing engine what files and directories to exclude. Simple Web Indexing System for Humans (SWISH), an ANSI C program written by Kevin Hughes at Enterprise Integration Technologies (EIT), is one package that offers this power and ease of use. Installation is simple (the documentation is on-line at http://www.eit.com/software/swish/swish.html), and the program is ideally suited for indexing entire web sites. If all the HTML files exist under one directory, the SWISH indexing can be accomplished in one pass; a single file, index.swish, is created, which is convenient when the index needs to be replicated or moved. SWISH has Web-aware properties, such as giving higher relevance to keywords found in titles and headers. In addition, SWISH (as well as htgrep), has the capability to create hyperlinks from the query results. Kevin has written a companion ANSI C gateway program, WWWWais, which also is quite easy to configure, install, and use. WWWWais is the front end for SWISH indexes as well as the more complex WAIS indexes; documentation is available at http://www.eit.com/software/wwwwais/wwwwais.html. SWISH does not have word-stemming capabilities or synonym dictionaries; for that, you need a more complex package, such as WAIS.

Introduction to WAIS

Wide-Area Information Search (WAIS) is based on the ANSI Z39.50 Information Retrieval Service and Protocol standard that was approved in 1988.(See note) At the most basic level, WAIS and WAIS-like products have two major components:

An indexing engine This takes a textual archive (it is not necessary to label this a database, although the term database often is used as a substitute for any large data store) and creates an index.

A query engine Uses the WAIS index to handle ad-hoc queries and returns hits against the index. The Z39.50 protocol can be used with any searchable data; it is a popular scheme to index on-line library catalogs. WAIS follows the client/server model. The client, or information requester, poses a question that is syntactically parsed by the WAIS server. If the question is understood, the server searches the relevant index or indexes and reports results.

A freely available implementation of the Z39.50 standard is maintained by the Center for Networked Information Discovery and Retrieval (CNIDR).(See note) Until recently, it was known as freeWAIS but then changed its name to ZDist.(See note) Release 1.02 includes a UNIX client, server, HTTP-to-Z39.50 gateway, and an e-mail-to-Z39.50 gateway. ZDist has been ported to the UNIX flavors Sun OS, Ultrix, and OSF. Another flavor of freely available WAIS is freeWAIS-sf (which supports fielded search); I discuss this package later in the chapter.

The concept of a Wide-Area Information Search is a powerful one in the Web environment. With a WAIS server, a content provider can index a database and make it available for searching on the Internet. The commercial concern WAIS, Inc. provides specialized software tools to parse unusual database formats to extract an index of its content as well as a specialized HTTP-WAIS gateway.(See note)

In the Web, results from a WAIS (or, more generally, any Z39.50-compliant) query can be hyperlinked to a base document. The WAIS engine also calculates a relevancy score for each index hit based on several factors, including number of occurrences of the keyword(s), proximity of the keywords to each other, and closeness of the keywords to the top of the document. Of course, relevancy scores might be misleading in certain situations. Suppose that I'm searching the SEC Filings archive for the keyword Citicorp (with the simple goal of finding Citicorp filings and learning more about its business). Some of the documents with the highest Citicorp relevancy score might be totally unrelated to Citicorp's business; for example, a company using Citicorp as a financing agent might mention the company dozens of times in the legal boilerplate.

Using WAIS to Index a Data Archive for the Web

To prepare a WAIS index of a data archive for Web consumption, it is necessary to understand the behavior of the waisindex command. Furthermore, if developers want to index HTML documents, additional indexing options enable them to transform the answers from the WAIS server into hotlinks pointing to the original source documents.

Using WAIS or any of its cousins (freeWAIS or freeWAIS-sf), the general waisindex command follows:

waisindex -export -d ~/my-wais-dir/my-wais-file -T FILE-TYPE *.extension

The -export flag tells waisindex to create an index from scratch (in this case, the WAIS index my-wais-file). The file type following the -T flag is ad hoc and arbitrary. By convention, TEXT is used for ASCII text, PS for PostScript, and GIF for GIF images. For example, I can use

waisindex -export -d ~/wais/source/psidx -T PS *.ps

to index only files with the .ps extension-presumably, PostScript documents. Similarly, text files can be indexed with this command:

waisindex -export -d ~/wais/source/textidx -T TEXT *.txt

After a WAIS index is created (which can be a lengthy process for a large collection of files), however, it would be a mistake to re-index from scratch when new entries appear. Instead, I would use the following command to incrementally update the WAIS text index that I created with the last command:

waisindex -a -d ~/wais/source/textidx -T TEXT *.txt

Naturally, these commands can be placed within simple shell scripts or Perl scripts to build the indexes painlessly.

When Marc Andreessen was at the NCSA (that is, before Netscape), he wrote a WAIS and Mosaic tutorial that still is useful (at the URL http://hoohoo.ncsa.uiuc.edu/Mosaic/wais-tutorial/wais.html); here is what he wrote about the relationship of WAIS file types to NCSA Mosaic's MIME scheme:

…a WAIS type retrieved as the result of a query is matched to a MIME type as though it were a file extension. In other words, because a file with extension .text normally is considered plain text (MIME type text/plain) by Mosaic, a WAIS query result of WAIS type TEXT also is considered text/plain. Similarly, if Mosaic were configured to recognize file extension .foo as MIME type application/x-foo, a WAIS query result of WAIS type FOO also would be considered of type application/x-foo.

An Overview of MIME Types and the mailcap File

UNIX users will find this overview useful. On most UNIX boxes and with most Web clients, the standard MIME types that a Web client understands (for example, GIF and JPEG images, HTML-formatted documents) can be extended in two ways. First, users can edit their .mailcap and .mime.types files (which live in the home directory). Second, the system administrator can alter system-wide mailcap and mime.types files. The configuration to recognize *.foo that Marc mentions could have been accomplished by either method. Each Web browser should come with documentation on where the default system-wide configuration files reside. To better understand the possibilities to extend the base MIME types, you can review my personal .mailcap file that follows:

audio/*; showaudio %s application/pdf; acroread %s application/x-pgn; xboard -ncp -lgf %s application/x-chess-pgn; xboard -ncp -lgf %s application/x-fen; xboard -ncp -lpf %s application/jpg; xv %s video/*; xanim %s

The file type and extension are on the left, and following the semicolons are the programs corresponding to that extension. Video file type, for example, no matter what the extension, fires up the xanim program. The %s is a parameter that is filled by the actual file as it is brought across the Internet. File types with an x- preceding the extension should be interpreted as experimental; the x- is not actually part of the extension. Hence, fen (a chess game recorded in Forsythe-Edwards chess notation) is an experimental file type that causes Tim Mann's xboard program to start and play the fen file over.(See note)

The companion file to the .mailcap is the .mime.types. Here is mine:

application/x-pgn pgn application/x-fen fen application/x-foo foo application/x-chess-pgn pgn application/pdf pdf audio/au au image/jpg jpg jpeg

Now it is clearer why the x- is not part of the extension; the .mime.types map MIME file types to physical extensions. Hence, both *.jpg and *.jpeg on disk are understood to be of type Image and subtype jpg.

Forms-Based WAIS Query Examples

The Internet Multicasting Service in its initial EDGAR server support used the industrial-strength commercial WAIS engine in its WAIS search of the SEC EDGAR filings.

Figure 22.4 shows the result of a user searching for Sun and Microsystems.

Figure 22.4 : The WAIS query results from Sun and Microsystems. Note that the relevancy score is not displayed by this gateway.

The Standard `wais.pl` Interface

The standard NCSA httpd distribution ships with Tony Sanders' Perl interface to WAIS, wais.pl, as Listing 22.1 shows.

Listing 22.1. The wais.pl WAIS search interface.

#!/usr/local/bin/perl # # wais.pl - WAIS search interface # # wais.pl,v 1.2 1994/04/10 05:33:29 robm Exp # # Tony Sanders <sanders@bsdi.com>, Nov 1993 # # Example configuration (in local.conf): # map topdir wais.pl &do_wais($top, $path, $query, "database", "title") # $waisq = "/usr/local/bin/waisq"; $waisd = "/u/Web/wais-sources"; $src = "www"; $title = "NCSA httpd documentation"; sub send_index { print "Content-type: text/html\n\n"; print "<HEAD>\n<TITLE>Index of ", $title, "</TITLE>\n</HEAD>\n"; print "<BODY>\n<H1>", $title, "</H1>\n"; print "This is an index of the information on this server. Please\n"; print "type a query in the search dialog.\n<P>"; print "You may use compound searches, such as: <CODE>environment AND cgi</CODE>\n"; print "<ISINDEX>"; } sub do_wais { # local($top, $path, $query, $src, $title) = @_; do { &'send_index; return; } unless defined @ARGV; local(@query) = @ARGV; local($pquery) = join(" ", @query); print "Content-type: text/html\n\n"; open(WAISQ, "-|") || exec ($waisq, "-c", $waisd, "-f", "-", "-S", "$src.src", "-g", @query); print "<HEAD>\n<TITLE>Search of ", $title, "</TITLE>\n</HEAD>\n"; print "<BODY>\n<H1>", $title, "</H1>\n"; print "Index \`$src\' contains the following\n"; print "items relevant to \`$pquery\':<P>\n"; print "<DL>\n"; local($hits, $score, $headline, $lines, $bytes, $type, $date); while (<WAISQ>) { /:score\s+(\d+)/ && ($score = $1); /:number-of-lines\s+(\d+)/ && ($lines = $1); /:number-of-bytes\s+(\d+)/ && ($bytes = $1); /:type "(.*)"/ && ($type = $1); /:headline "(.*)"/ && ($headline = $1); # XXX /:date "(\d+)"/ && ($date = $1, $hits++, &docdone); } close(WAISQ); print "</DL>\n"; if ($hits == 0) { print "Nothing found.\n"; } print "</BODY>\n"; } sub docdone { if ($headline =~ /Search produced no result/) { print "<HR>"; print $headline, "<P>\n<PRE>"; # the following was &'safeopen open(WAISCAT, "$waisd/$src.cat") || die "$src.cat: $!"; while (<WAISCAT>) { s#(Catalog for database:)\s+.*#$1 <A HREF="/$top/$src.src">$src.src</A>#; s#Headline:\s+(.*)#Headline: <A HREF="$1">$1</A>#; print; } close(WAISCAT); print "\n</PRE>\n"; } else { print "<DT><A HREF=\"$headline\">$headline</A>\n"; print "<DD>Score: $score, Lines: $lines, Bytes: $bytes\n"; } $score = $headline = $lines = $bytes = $type = $date = ''; } open (STDERR,"> /dev/null"); eval '&do_wais';

Some Observations about `wais.pl`

The code line

open(WAISQ, "-|") || exec ($waisq, "-c", $waisd, "-f", "-", "-S", "$src.src", "-g", @query);

is full of action.

The -| opens a pipe to standard in and does an implicit fork. Input to the file handle WAISQ is piped from the stdout of the waisq process.

Thus, the WAISQ file handle fills with the answer to the WAIS query and the returned elements are massaged for cosmetic presentation. Note in particular the $score variable, which contains the relevancy score.

Debugging the WAIS Interface

Taking a step back to first principles, the developer first should experiment with waisq on the command line, constructing queries of the general form

$waisq, "-c", $waisd,"-f", "-", "-S", "$src.src", "-g", @query

by substituting concrete examples in place of the $ variables. The UNIX Shell command will be in this general form:

/usr/local/bin/waisq -c "~/my_wais_index -f -S my_wais.src -g keyword1 keyword2 ...

If WAIS does not behave on the command line, don't panic yet. Commercial WAIS differs from freeWAIS; freeWAIS differs from another variant that I discuss shortly: freeWAIS-sf. The developer should make a habit of consulting the documentation and on-line manual pages. If the site has a WAIS or WAIS-like engine that behaves a little differently than the Perl gateway would like, by all means, my advice is to hack the gateway to smooth things out. Usually, minor tweaking of a slightly misbehaving gateway offers a modicum of amusement and should not be too much of a time sink.

Another Way to Ask a WAIS Query

The command waissearch is another way to ask a WAIS query. On the command line, the usage for waissearch follows:

Usage: waissearch [-h host-machine] /* defaults to localhost */ [-p service-or-port] /* defaults to z39_50 */ [-d database] /* defaults to nil */ [-m maximum_results] /* defaults to 40 /* [-v] /* print the version */ word word...

For example,

waissearch -p 210 -d my_wais_index -m 10 keyword1

searches my_wais_index and returns a maximum of 10 hits on keyword1.

Listing 22.2 shows a simple waissearch.pl interface that I wrote to scan an index of Corporate Proxies (Filing DEF 14A) for the keyword stock. I'm using the freeWAIS-sf index and query software (discussed later), but the general spirit of things is the same for all WAIS-like engines.

Listing 22.2. A waissearch.pl interface.

#!/usr/local/bin/perl # # waissearch.pl : primitive waissearch interface # # using freeWAIS-sf # # Mark Ginsburg 5/95 # # require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; &html_header("Waissearch Gateway Demo"); $hit = 0; @answer = '/usr/local/bin/waissearch -p 210 -d DEF-14A.src stock < /etc/null'; print "<HEAD>\n<TITLE>Search of DEF-14A </TITLE>\n</HEAD>\n"; print "<BODY>\n<H1> Search of DEF-14A </H1>\n"; print "Index DEF-14A contains the following\n"; print "items relevant to stock:<P>\n"; print "<DL>\n"; foreach $elem (@answer) { print "$elem\n"; } print "</BODY>\n"; exit 0;

Observations about `waissearch.pl`

I presupplied the word stock to simplify the example; of course, I could have passed it to the script via an HTML form.

Notice this line:

@answer = '/usr/local/bin/waissearch -p 210 -d DEF-14A.src stock < /etc/null';

Why am I redirecting the file /etc/null to the waissearch command's stdin? Because, if I type

/usr/local/bin/waissearch -p 210 -d DEF-14A.src stock

on the command line, I get this result:

Search Response: NumberOfRecordsReturned: 40 1: Score: 77, lines: 208 'BOEING_CO.extr.14DEF.1.html' 2: Score: 70, lines: 438 'RITE_AID_CORP.extr.14DEF.1.html' 3: Score: 61, lines: 722 'FEDERAL_EXPRESS_CORP.extr.14DEF.1.html' 4: Score: 61, lines: 411 'MGM_GRAND_INC.extr.14DEF.1.html' 5: Score: 60, lines: 936 'QVC_NETWORK_INC.extr.14DEF.1.html' 6: Score: 58, lines: 288 'JACOBSON_STORES_INC.extr.14DEF.1.html' 7: Score: 57, lines: 441 'BELL_ATLANTIC_CORP.extr.14DEF.1.html' 8: Score: 57, lines: 483 'MGM_GRAND_INC.extr.14DEF.2.html' 9: Score: 55, lines:1069 'COLGATE_PALMOLIVE_CO.extr.14DEF.1.html' 10: Score: 54, lines: 321 'BELL_ATLANTIC_CORP.extr.14DEF.2.html' 11: Score: 54, lines: 787 'COLGATE_PALMOLIVE_CO.extr.14DEF.2.html' 12: Score: 54, lines: 687 'GAP_INC.extr.14DEF.1.html' [...] 36: Score: 46, lines: 807 'INTEL_CORP.extr.14DEF.2.html' 37: Score: 46, lines: 744 'DISNEY_WALT_CO.extr.14DEF.1.html' 38: Score: 44, lines: 748 'PFIZER_INC.extr.14DEF.1.html' 39: Score: 43, lines: 670 'ITT_CORP.extr.14DEF.1.html' 40: Score: 43, lines:1158 'GOODYEAR_TIRE_AND_RUBB.extr.14DEF.1.html' View document number [type 0 or q to quit]:

Note that the WAIS server is asking me a question now. I have to anticipate this question in the script and pipe in a q. After typing q on the command line, the server comes back with this:

Search for new words [type q to quit]:

Now I need to feed it a second q! The second q does the trick and I return to the command line; that is, the WAIS server stops the session.

The moral of the story is, when a developer is trying to design a new gateway, it is imperative to pay strict attention to the command-line behavior of the package.

So, as you might divine, the file /etc/null looks like this:

edgar{mark}% cat /etc/nullq
q

Figure 22.5 shows that things still are pretty raw, but the query works. All that remains is cosmetic mop-up of the output. In fact, it is not hard to wrap HTML hotlinks around the file names, and I show this in the next example.

Figure 22.5 : "Dirty" output from a prototype waissearch gateway script.

freeWAIS-sf

Ulrich Pfeiffer's freeWAIS-sf represents an experimental extension to CNIDR's freeWAIS (now known as ZDist)(See note) The sf stands for structured fields.

The most important extension of freeWAIS-sf is the new capability of the data archive administrator to create a data format file before indexing. Listing 22.3 shows a format file, 10-k.fmt, that describes the structure of a few fields of the annual 10-K corporate report.

Listing 22.3. A freeWAIS-sf format file, 10-K.fmt.
<record-end> /<P>/ <field> /CONFORMED NAME:/ ccn TEXT BOTH <end> /CENTRAL/ <field> /INDEX KEY:/ cik TEXT BOTH <end> /STANDARD/ <field> /IAL CLASSIFICATION:/ sic TEXT BOTH <end> /IRS/

The required <record-end> tag specifies what character separates multiple 10-Ks in the same file (a situation that does not occur in the EDGAR archive).

The <field> tags are of the general form

<field> /regexp-start/ field-name data-type dictionary <end> /regexp-end/

Therefore, the preceding format definition names the ccn, cik, and sic fields and assigns regular expressions at their start and end. The keyword TEXT declares these index fields to be of TEXT index type (SOUNDEX phonetic type is another possibility). Interestingly, the freeWAIS-sf flavor of waisindex creates an inverted index for each field specified in the format file. On the query side, the users can limit their searches to certain keyword(s), thus drastically reducing execution time.

The regular expressions are a simple and powerful way to delimit fields, and fields may overlap. Phonetic coding can be enabled on a field-by-field basis to permit "sounds-like" searching. The indexing engine automatically creates a "global" field-for use if the client omits specific named fields in a query.

The companion file of a *.fmt (format) file in freeWAIS-sf is a field-definition file, or *.fde file. Here is a sample 10-k.fde file:

ccn	Company Conformed Name
cik	Central Index Key
sic	Standard Industrial Classification

This file gives a full description of each field.

A sample fielded query might look like this:

ccn=(digital AND equipment)

This limits the search to the inverted index built on the ccn field, which had regular expressions delimiting its start and end as specified previously.

Building a freeWAIS-sf WAIS Index: HTML Extensions

The freeWAIS-sf package offers interesting options to the web developer who needs to index a set of HTML documents. Consider the following command and compare it to the more generic waisindex command discussed earlier:

waisindex -export -d /web/wais/source -t URL /web/profile/auto/extracts http://edgar.stern.nyu.edu/ptest/auto/extracts *.html

A few of the similar features are recognizable immediately: the -export flag tells waisindex this will be a new index, created from scratch. And the *.html parameter at the end of the command limits the eligible files to the HTML extension.

The mysterious -t URL /web/profile/auto/extracts http://edgar.stern.nyu.edu/ptest/auto/extracts warrants more investigation, however.

The first parameter to the -t URL flag, /web/profile/auto/extracts, is the directory to strip from results generated by later queries to the WAIS server. The second parameter, http://edgar.stern.nyu.edu/ptest/auto/extracts, is the directory to prepend to the query response. The astute reader might notice the rationale for these two parameters: freeWAIS-sf is giving the user a chance to strip off unwanted directory information and then add in a customized prefix string to construct a valid HTML tag.

With a little experimentation, the web developer should be able to manipulate these two parameters to form valid HTML. To test these, just run the waisindex against a small group of HTML files and then use a freeWAIS gateway (the one supplied with the software or some variation thereof) and supply a keyword that is known to be in one or more of the documents. The answer(s) should come back as HTML hotlinks. If something is broken, the faulty link is easy to debug. After the waisindex is proceeding smoothly, it can be embedded inside a script. More powerfully, it can descend and process subdirectories recursively.

Consider the directory structure shown in Listing 22.4, which starts at the NYU Corporate Extract directory, /web/profile/auto/extracts.

Listing 22.4. A directory structure to be indexed.

3COM_CORP/ INTEL_CORP/ AMRE_INC/ ITT_CORP/ ANALOGIC_CORP/ JACOBSON_STORES_INC/ APPLE_COMPUTER_INC/ JOHNSON_AND_JOHNSON/ A_L_LABORATORIES_INC/ LILLY_ELI_AND_CO/ BANKAMERICA_CORP/ LIMITED_INC/ BELL_ATLANTIC_CORP/ MARTIN_MARIETTA_CORP/ BLACK_AND_DECKER_CORP/ MAYFLOWER_GROUP_INC/ BOEING_CO/ MCCAW_CELLULAR_COMMU/ CITICORP/ MCDONALDS_CORP/ COLGATE_PALMOLIVE_CO/ MCDONNELL_DOUGLAS_CO/ COMPAQ_COMPUTER_CORP/ MCGRAW_HILL_INC/ DEERE_AND_CO/ MCI_COMMUNICATIONS_C/ DELL_COMPUTER_CORP/ MGM_GRAND_INC/ DEXTER_CORP/ MICROSOFT_CORP/ DISCOVER_CREDIT_CORP/ MOTOROLA_INC/ DISNEY_WALT_CO/ NOVELL_INC/ DONNELLEY_R_R_AND_SONS/ PFIZER_INC/ DREYFUS_A_BONDS_PLUS_INC PHELPS_DODGE_CORP/ EXXON_CORP/ PHILIP_MORRIS_COMPAN/ FEDERAL_EXPRESS_CORP/ PROCTER_AND_GAMBLE_CO/ FORD_CREDIT_1993-A_G/ QVC_NETWORK_INC/ GAP_INC/ RITE_AID_CORP/ GETTY_PETROLEUM_CORP/ SMITHFIELD_FOODS_INC/ GILLETTE_CO/ SUN_CO_INC/ GOODYEAR_TIRE_AND_RUBB/ SUN_DISTRIBUTORS_L_P/ HECHINGER_CO/ TYSON_FOODS_INC/ HEINZ_H_J_CO/ UNISYS_CORP/ HERTZ_CORP/ UPJOHN_CO/ HEWLETT_PACKARD_CO/ USX_CAPITAL_LLC/ HILTON_HOTELS_CORP/ XEROX_CORP/ IBM_CREDIT_CORP/ ZENITH_ELECTRONICS_C/ IBP_INC/

Below each company, there may or may not exist a subdirectory to house certain corporate filings. The directory ZENITH_ELECTRONICS_C, for example, contains these subdirectories:

10-K/ S-3/ 8-K/ DEF14-A/

These represent annual reports, stock or bond registrations, current events, or proxies, respectively.

The problem then becomes how to write a shell script to call waisindex appropriately and how to navigate the directory structure starting at the top of the extract tree, /web/profile/auto/extracts.

Fortunately, Perl is strong at directory navigation.

Listing 22.5 shows the code for wais-8k.pl, a Perl script to recursively scan all corporate profile directories for 8-K filings (Current Events) and create the 8-K WAIS index (if it did not exist previously) or append to it (if it already exists).

Listing 22.5. The wais-8k.pl code.

#!/usr/local/bin/perl # author: Oleg Kostko oleg@edgar.stern.nyu.edu # Initialize path variables. $path="/web/profile/auto/extracts"; $indexpath="/usr/local/edgar/web/wais-sf/8-K"; main: { print "Please check if the following path variables are correct:\n"; print "Path to the .html files: $path \n"; print "Path to the index files: $indexpath \n"; print " Enter y/n : "; if ( <STDIN> =~ /y|Y/ ) { &index_files(); } else { print "Change the variables in the script.\n Thank you and Goodbye. \n"; } exit 0; } sub index_files { local($company,$count,@dirs,@forms); # Initialize var $count to 0 if no index files exist # or 1 if index files have been created. if (-f "$indexpath.dct") { $count=1; } else { $count=0; } print "Working in dir $path.\n"; opendir(CUR,"$path") || die "Cannot open dir $path"; @dirs=readdir(CUR); foreach $company (@dirs) { if ($company =~ /\./) { next; } else { chdir("$path/$company/8-K") || next; } if ($count == 0) { 'waisindex -export -d $indexpath -t URL /web/profile/auto/extracts http://edgar.stern.nyu.edu/ptest/auto/extracts *.html '; $count++; } else { 'waisindex -a -d $indexpath -t URL /web/profile/auto/extracts http://edgar.stern.nyu.edu/ptest/auto/extracts *.html '; $count++; } # end else } # end foreach print "Closing dir $path.\n"; } # end sub

Code Discussion: wais-8k.pl

Short but sweet, Oleg Kostko's program soars and dives among the nests of directories, scooping out only the 8-K filings and WAIS indexing them. This script easily can be adapted to other situations involving a top-level directory and nests of subdirectories. The machinations involving the waisindex command involve stripping off the physical path of the files to be indexed (not part of the HTML hotlink, should they appear in a query answer) and then prepending the virtual path on disk as the httpd server knows it.

The NYU EDGAR interface offers a Corporate Profile Keyword Service that is based on freeWAIS-sf. I do not take advantage of the fielded search in this example (actually, Boolean searching on structured fields is not fully developed yet in freeWAIS-sf); however, I do use a Web-freeWAIS-sf gateway (SFgate), which is provided as part of the freeWAIS-sf distribution. Caveat: SFgate is very much a work in progress and does not use standard waisq or waissearch calls to interface with WAIS.

Figure 22.6 demonstrates the Corporate Profile Keyword Service using Ulrich Pfeiffer's SFgate.

Figure 22.6 : A free WAIS-sf interface, where the user can enter keywords and choose filing types of interest.

Pros and Cons of WAIS and WAIS-Like Packages

Commercial WAIS and CNIDR's freeWAIS (now ZDist) are powerful indexing and query engines. The strong point is the obvious fit between the client/server model of WAIS server and WAIS client and the client/server model of Web information requester and Web information provider. A WAIS query might span an immense amount of cataloged library information in one or more indexes. The downside is that a WAIS index is expensive on the disk: about 1-to-1 with the file it indexes. A back-of-the-envelope calculation might steer sites away from big indexes if there are disk-storage constraints. Still, the ZDist distribution at CNIDR is promising because ports to various platforms and features continue to be added.

freeWAIS-sf is an intriguing package, but I had difficulty deciphering the often cryptic source code and arcane comments. The project is quite clearly a messy work in progress. The SFgate is a reasonable interface but it does not use the standard waisq function call shown in Tony Sander's wais.pl earlier in this chapter. This omission makes SFgate much harder to debug; I did not get Boolean searches on a multiple fielded search to behave consistently (that is, I got unexpected results from various Boolean permutations). I look forward to future releases of freeWAIS-sf.

Optimally, the fielded query extensions and phonetic searching in freeWAIS-sf will filter back to CNIDR for a best-of-both-worlds scenario.

Spiders and Robots: Distributed Information Retrieval

At the simplest level, a robot script should be able to start at a base URL that can be presupplied or filled in at runtime by a user and perform the following actions:

Retrieve the contents of the base URL.
Parse the contents of the base URL and detect all the hyperlinks that it references.
Follow the hyperlinks thus acquired and parse their contents, and so on, ad infinitum.

It's the ad infinitum phrase that can prove troubling to network administrators. It is safer to limit the script by

Restricting it to a certain set of Internet domains
Stopping it after it fetches a predefined maximum number of hyperlinks

Take a look at a simple robot application that prompts the user for the base URL and halts after all hyperlinks are found, or the first 100-whichever comes first. The program uses the HTTP content-fetching routine http_get.c, which was presented in Chapter 20, "Gateway Programming Fundamentals."

Figure 22.7 shows the initial screen where the base URL is specified.

Figure 22.7 : A base URL is input and the Web spider starts crawling to fetch the first 100 links.

The output, the first 100 links, is in tabular format, as shown in Figure 22.8.

Figure 22.8 : The hyperlinks are listed along with their depth level.

The output depth level is calculated by taking the base URL to be at level 0. All links at level 1 are those found within the base URL. All links at level 2 are those referenced by level-1 links, and so on. This choice of formatting was arbitrary and could have been accomplished with a set of nested bulleted lists, for example.

Now take a look at the spider code shown in Listing 22.6, originally authored by Cyrus Lowe as an advanced software class project, which created the output shown in Figure 22.9.

Figure 22.9 : The first 100 hyperlinks are listed and the robot.cgi program ends. The depth of each link is displayed as well.

Listing 22.6. The robot.cgi code to fetch the first 100 links.

#!/usr/local/bin/perl # robot.cgi: constructs a table of the first 100 links # found starting with the user-supplied base URL # original author: Cyrus Lowe, modifications by Mark Ginsburg require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; ## get first url ## repeat ## fetch url content ## collect url (content,parent url) ## increment trailer ## if trailer = counter then exit ## else get next url ## end repeat ## display the list $/=""; # for multi-line searching. Camel, page 113. $*=1; # for multi-line searching. Camel, page 114. &parse_request; $query = $query{"baseurl"}; $bogus = 999999; # bogus url indicator value $maxcount = 101; # maxmimum # links %list = (); # initialize url list array @queue = (); # initialize url queue $counter = $level = $trailer = 0; #initialize url counter and level &get_first_url; until ($trailer >= $counter || $counter >= $maxcount) { &collect_url($queue[$trailer]); $trailer++; } # end of until &results(%list); exit 0; #################################################### # subroutines ###################### get_first_url ############### sub get_first_url { $url = $query; $queue[0] = $url; $list{$url} = 0; $counter++; } # end of get_first_url ############### collect_url ################## sub collect_url { $url = $_[0]; open (HTML,"/class/mginsbur/bin/http_get $url|") || die "cannot open $url"; while (<HTML>) { if ( /URL Not found/i ) { print "$url not found"; $list{$url} = $bogus; # assign bogus to the url counter last; # break out of while loop } &build_url_list($url, $_); if ($counter >= $maxcount) { last; } } } # end of collect_url ################### test_list_and_queue ################### sub test_list_and_queue { foreach $i (@queue) { print "$i\n"; } foreach $i (keys %list) { print "Key: $i, Level: $list{$i}\n"; } } # end of test_list_and_queue ##################### build_url_list ######################## sub build_url_list { $parent = $_[0]; $line = $_[1]; @line2 = split (/a\shref="http:\/\//i,$line); # break up the line @line3 = split(/"/, $line2[1]); if ($line3[0] ne "" && $counter < $maxcount) { $newurl = "http://".$line3[0]; if (!$list{$newurl}) { $list{$newurl} = $list{$parent} + 1; $queue[$counter] = $newurl; $counter++; } } } # end of sub build_url_list ########################### results ######################### sub results { local(%list) = @_; # Print Header Info &html_header("Link-Fetcher Results"); print "<BODY BGCOLOR=\"#FFFFFF\" TEXT=\"#000000\">"; print "<BR><center>"; print "<br><br>"; print "<TABLE COLSPEC=\"L20 L20 L20\" border=6>\n"; print "<CAPTION ALIGN=TOP>Link-Fetcher Results</CAPTION>"; print "<tr>"; print "<th>Number</th><th>URL</th><th>Level</th></tr>\n"; $dummy=1; foreach $i (sort keys %list) { print "<TR><td>$dummy</td><TD><a href=\"$i\">$i</A></td><td>$list{$i}</td></tr>\n"; $dummy++; } print "</table><BR><center>"; print "<A href=\"/mgtest/robot.html\">Search again</A><BR><BR>"; print "</body></html>"; }

Warning

Developers of spider programs have to be very careful to avoid falling into local infinite loops. Consider this common scenario. After a base URL is entered, the first link within the base URL is followed and it, in turn, references other links. These sublinks are followed; however, one of those references the base URL. The program then, unless precautions are taken, follows the self-reference back to the base URL creating the dreaded loop. Study the code in Listing 22.6 carefully to see how to sidestep that danger by accumulating a table, in memory, of links already fetched.

The spider code is assembling the list of hyperlinks and necessarily also is accumulating raw content. The content then can be indexed by a search engine; practical implications of this at the corporate level are discussed in the next section. The more detail-oriented Perl programmer also might notice an important trick: the two innocuous-looking code lines

$/=""; $*=1;

are in fact quite important. Only in this way can the regular end-of-line marker be turned off in a pattern match and the contents be parsed paragraph by paragraph, which is what you want to do when parsing HTML. Why? Because often a hyperlink is broken up across two or more lines. Therefore, you want to do multiline matching spanning the end-of-line marker. The script's most important function is to identify accurately all the hyperlinks contained in a fetched document. Note in passing that you could set up an exclusion table, if desired, to skip certain hyperlinks (those containing a # character, which denotes an intradocument link, or those containing the mailto: tag). If you want to create an exclusion table, my advice is to set it up as an external file and to write a small subroutine to read it in; avoid hard coding the exceptions in the body of the code. Intradocument link URLs also can be handled by stripping off the # character and everything following it to test the possibility that this refers to a new URL.

Indexing a Corporate Intranet: A Case Study

In my recent travails at a large global investment bank, I was charged with the task of creating a site-wide index of all hypertext documents. These documents are housed at a large central server (for business units that do not want to administer their own web server), or they reside in satellite servers scattered across 15 locations globally.

Restating the task then, it is necessary (in the absence of robust commercial products that can create and read search indexes on a distributed basis) to fetch the documents to the central (fast) server, create a central index, but then tweak the user-query interface. Why this adjustment? Careful consideration of this question sheds light on the core issue: Software right out of the box will create indexes and generate HTML query forms, but the hyperlinks resulting from a query will all point to the machine where the indexes were produced. Clearly, a large firm does not want to burden one machine with the twin tasks of index creation and site-wide document serving. Instead, the hyperlinks that represent a user's query ("Show me all documents with the words 'Institutional' and 'Investor' in them, and rank these documents by a confidence algorithm") must be altered to point back to the machine of origin-the machine that normally houses the document.

I designed a solution using two commercial products: Verity's Topic Internet Server (TIS) and Excite's Excite Web Server (EWS) software. Both products are quite good and get the job done; for the purposes of this book, the EWS product is more thematic because it has a fully open interface written completely in Perl. I discuss in the next section the architecture of the EWS solution.

Building Blocks of an Excite Web Server Intranet Solution

To accomplish the goals set forth in the preceding paragraph, I needed to install the EWS product (which is freely downloadable from http://www.excite.com), unpack and install it on the central server, and then schedule a UNIX cron job to run nightly to fetch firm-wide HTML to the same machine.

Does this sound familiar? The cron fetching program is a simple variant on the robot.cgi program that I already discussed in this chapter. The data is fetched to the central file system, with remote machine domains being mapped to the local document file system. All files coming from the Fixed Income business area might have the string "USFI" somewhere in their Internet domain, for example; I can write these files out to the local file system starting at the base directory /usr/local/etc/HTML-DATA/USFI. The mapping of domain name to directory is accomplished with a simple table lookup. Furthermore, exclusion tables are used to exclude mailto: tags, intradocument links, *.gif, *.tiff, *.jpeg, and so on, hyperlinks in the fetching process.

Suppose that I've got a working version of the fetching program, and nightly I build a local file system of data. Wait! Why do I need to fetch all the remote files every night? That is terribly inefficient, and you've already seen a way to avoid it-the get_head program I discussed at length in Chapter 20. A brief review of the get_head program should turn on the proverbial lightbulb; use of the HEAD method (or, equally well, a CONDITIONAL GET method) allows the fetching program to test the document-modification date before the actual fetch. If I have a database of document timestamps (and the get_head program shows you precisely how to build it), I'm in business; I need only fetch new documents or those documents for which the timestamps have been altered.

Now that I've saved the firm a lot of network traffic with the HEAD method trickery, I find out that I'm not done yet. Use of the generic EWS query software will, as previously mentioned, generate an inappropriate set of answer hyperlinks that will all point to the central machine. How can I arrange the hyperlinks to point to the document origin machines instead? The answer lies in the creation of a parallel script file system that uses the Location header you first saw in Chapter 20.

Here are the details of the interface design:

While I am creating the document file system (/usr/local/docroot/USFI/...), I also am creating a parallel script file system (/usr/local/etc/httpd/cgi-bin/USFI/...).
Each time I fetch a document to the local file system (/usr/local/docroot/USFI/mexico-bond.html), I write out a corresponding script (/usr/local/etc/httpd/cgi-bin/USFI/mexico-bond.cgi) on the local file system.
The script will use the Location header to direct the request back to the document origin. The mexico-bond.cgi program, for example, is extremely short and might look like this:
#!/usr/local/bin/perl
print "Location: http://www.bigbank.usfi.com/mexico-bond.html\n\n";
exit 0;

The mysterious actions detailed in these notes become clearer when I explain that the EWS search engine uses an open set of Perl interface scripts. It is a simple matter for the developer to intercept user queries and alter the behavior of the EWS output. In my case, all I need to do is change the hyperlink to point to the newly created redirection script rather than the source document. Therefore, the central server just needs to do the work of assembling the list of hyperlinks that show the user which documents matched the query. When the user clicks on one of them, the network and CPU load is transferred to the remote machine where the document is authored.

If the user queries the document archive on the term Mexico and the EWS engine generates a number of hits, for example, it is only necessary for me to change the Perl interface library to transform the hyperlink (a single line of code to substitute one regular expression for another). Now the concept of writing scripts in file system hierarchies that parallel the document hierarchies is sounding more and more clever-so I tell my developer friends when we go out for Newcastle Nut Brown Ale. A simple change to the hyperlink makes all the difference. Instead of

<a href="http://www.bigbank.central.com/USFI/mexico-bond.html">Mexico Bond Report </a>

for example, the transformation routine deflects requests back to the remote machinesby yielding this:

<a href="http://www.bigbank.usfi.com/cgi-bin/USFI/mexico-bond.cgi">Mexico Bond Report </a>

Typically, in Perl there's more than one way to do it. Here, I outlined my solution to solve the clumsy problem of a monolithic machine bearing much of the site-search burden; there are other ways to approach the problem. An even more radical approach would be to do nothing and wait for effective distributed search agent technology. Because this wasn't an NSF-sponsored project, I am constrained on providing source code, but I think the examples I've discussed in the context of the solution should give you an excellent head start when it's time to solve your own intranet search puzzles.

Excite does have some peculiar features in its current release. If I want to retrieve documents related to my associate Anne Dinning, for example, I cannot force adjacency of Anne and Dinning. Even though I want to see Anne Dinning, Excite will return a superset: documents with the exact phrase Anne Dinning and documents with the terms separated (although, in the latter case, the relevancy score decreases). Other engines, such as Verity, have no trouble with adjacency yet have a less open interface. It's also possible, of course, to implement simple search schemes using products such as Glimpse, which you learn about in the next section. In a corporate environment, the degree of customization offered by the base product often turns out to be a key feature.

Introduction to Glimpse

Glimpse is an interesting and easy-to-use package from the Computer Science Department at the University of Arizona.(See note) There are two components: one administrative-the creation of the indexes-and one end-user oriented-a Glimpse query on a previously built Glimpse index.

Glimpse Indexing

To query a Glimpse index, I first must build it. On a UNIX box, this is as easy as saying

glimpseindex .

to index every file in my current directory, or

glimpseindex ~

to index every file in my home directory. As the Glimpse on-line manual page says, "Glimpse supports three types of indexes: a tiny one (2-3% of the size of all files), a small one (7-9%), and a medium one (20-30%). The larger the index the faster the search." The performance of the indexing engine is reasonable; the authors give a time of 20 minutes to index 100MB from scratch on a Sparc 5. Glimpse has been ported, by the way, to Sun OS, Dec Alpha, Sun Solaris 2.x, HP/UX, IBM AIX, and Linux.

How to vary the index size? It's easiest to refresh my memory by asking for help on the command line or by consulting on-line or printed manual pages. Of course, I also could travel to the Arizona Web page and look, but let's stay local for the time being. I type

glimpseindex -help

and I get this:

This is glimpseindex version 2.1, 1995. usage: glimpseindex [-help] [-a] [-f] [-i] [-n [#]] [-o] [-s] [-w #] [-F] [-H dir] [-I] [-S lim] [-V] dirs/files summary of frequently used options (for a more detailed listing see "man glimpse"): help: outputs this menu a: add given files/dirs to an existing index b: build a (large) byte level index to speed up search f: use modification dates to do fast indexing n #: index numbers; warn if file adds > #% numeric words: default is 50 o: optimize for speed by building a larger index w #: warn if a file adds > # words to the index F: expect filenames on stdin (useful for pipelining) H 'dir': .glimpse-files should be in directory 'dir': default is '~'

Immediately, I see the version number (and I can search the Internet if I suspect that I am out of date). I find that the command options -o and -b both look interesting. For example,

glimpseindex -o .

would create a Glimpse index that is larger than the default index size on files in my current directory. How much larger? The man pages tell me that the default index is "tiny." I must dedicate disk space for the index that is 2-3 percent of the size of the file(s) to be indexed. The -o option creates a "small" index that is 7-8 percent as big as the file(s), and -b creates the "medium" index that is about 20-30 percent of the file(s). As expected, the trade-off is between disk space and query execution time; the bigger an index I build in the indexing step, the faster the users can search on my index files. If I use the -f flag

glimpseindex -f .

this is fast indexing; Glimpse checks file-modification dates and adds only modified files to the index. The authors report an indexing time of about five minutes on a 100MB data file, using a Sparc 5 with the -f option.

Tip

Don't forget to read the man pages! They are an invaluable reference guide. It's a good idea on a UNIX box to issue man -t [topic] to get a hard copy of the man pages.

A Practical Test of Glimpse

I indexed three weeks' worth of the May 1995 NYU EDGAR Server access log (a 3.0MB file), and an index file of 36,122 bytes was created-indeed, a "tiny" index of only about 1.2 percent of the original file.

Setting Up a Glimpse Query

If I type

glimpse

on the command line, I get this:

This is glimpse version 2.1, 1995. usage: [-#abcdehiklnprstwxyBCDGIMSVW] [-F pat] [-H dir] [-J host] [-K port] [-L num] [-R lim] [-T dir] pattern [files] summary of frequently used options: (For a more detailed listing see 'man glimpse'.) #: find matches with at most # errors c: output the number of matched records d: define record delimiter h: do not output file names i: case-insensitive search, e.g., 'a' = 'A' l: output the names of files that contain a match n: output record prefixed by record number w: pattern has to match as a word, e.g., 'win' will not match 'wind' B: best match mode. find the closest matches to the pattern F 'pat': 'pat' is used to match against file names G: output the (whole) files that contain a match H 'dir': the glimpse index is located in directory 'dir' L 'num': limit the output to 'num' records only For questions about glimpse, please contact 'glimpse@cs.arizona.edu'

Note the -#: find matches with at most # errors option. This is a very powerful feature of a Glimpse query; the user can define an ad-hoc error-tolerance level.

In conformance with my integration advice, I become familiar with the Glimpse query's behavior on the command line by typing

glimpse -1 interacess

I introduce a typo ("interacess" instead of the correct "interaccess") but set the error tolerance to 1. The answer comes back:

Your query may search about 100% of the total space! Continue? (y/n)

I type y and the query completes:

/web/research/logs/glimpse_logs/access_log: nb-dyna93.interaccess.com - - [30/May/1995:14:54:47 -0400] "GET /examples/mas_10k.html HTTP/1.0" 200 45020 /web/research/logs/glimpse_logs/access_log: nb-dyna93.interaccess.com - - [30/May/1995:14:54:50 -0400] "GET /icons/back.gif HTTP/1.0" 200 354 /web/research/logs/glimpse_logs/access_log: nwchi-d138.net.interaccess.com - - [02/Jun/1995:14:38:14 -0400] "GET /mutual.html HTTP/1.0" 304 0 /web/research/logs/glimpse_logs/access_log: nwchi-d138.net.interaccess.com - - [02/Jun/1995:14:38:20 -0400] "GET /icons/orangeball.gif HTTP/1.0" 304 0 /web/research/logs/glimpse_logs/access_log: nwchi-d116.net.interaccess.com - - [02/Jun/1995:15:32:20 -0400] "GET /SIC.html HTTP/1.0" 200 917 /web/research/logs/glimpse_logs/access_log: nwchi-d116.net.interaccess.com - - [02/Jun/1995:15:32:22 -0400] "GET /icons/back.gif HTTP/1.0" 200 354

Just the garden-variety NCSA HTTP server log entries matching "interacess" with an error tolerance of 1.

Caution

Pay special attention to software packages that ask command-line questions interposed between a query and an answer. They might require special handling in a Web-integration effort. The Glimpse query engine asked such a question:

Your query may search about 100% of the total space! Continue? (y/n)

If it's not possible to disable this behavior (that is, to run in "silent mode"), the Web gateway program has to supply the "y" ahead of time.

Building a Web Interface to Glimpse

If I somehow were unable to search the Internet for Glimpse integration tools, I could construct with relative ease a Perl gateway to the package. Keeping in mind the guideline to be familiar with command-line behavior of the program, I already have run a Glimpse query and studied its behavior, shown previously.

I build a simple HTML form with the code shown in Listing 22.7 to interface to the Glimpse package, as shown in Figure 22.10.

Figure 22.10 : The user can enter Glimpse query terms and specify and error-tolerance level.

Listing 22.7. The Glimpse query front end.

<H1>Glimpse Interface</H1> <FORM METHOD="POST" ACTION="http://edgar.stern.nyu.edu/cgi-bin/glimpse.pl"> This form will search the access logs using a specified error tolerance limit. Enter the company or university: <INPUT NAME="company"> <P> Select the error level to allow; <SELECT NAME="error"> <OPTION SELECTED> 0 <OPTION> 1 <OPTION> 2 <OPTION> 3 <OPTION> 4 </SELECT> <p> Set maximum hits: <SELECT NAME="max"> <OPTION SELECTED> 1000 <OPTION> 500 <OPTION> 250 <OPTION> 100 <OPTION> 50 <OPTION> 10 </SELECT> <p> To submit your choices, press this button: <INPUT TYPE="submit" VALUE="Run glimpse">. <P> To reset the form, press this button: <INPUT TYPE="reset" VALUE="Reset">. </FORM> <HR> <P> <A HREF="http://edgar.stern.nyu.edu/EDGAR.html"> <img src="http://edgar.stern.nyu.edu/icons/back.gif"> Return to our Home Page.</A>

Note the flexible feature of a maximum hit cutoff.

Tip

A maximum hit cutoff is a very good idea to implement in any situation where there is the possibility of a mammoth number of records being returned. In this way, the developer can nimbly sidestep complaints that the server is hanging when in fact the volume of the answer is the reason for the slow response. Another good idea is to set the special Perl variable $| = 1 (unbuffered I/O) so that the result screen can be built line by line, providing partial results to the user (which might be enough to answer the query) as fast as possible.

Without undue delay, the Glimpse query completes and returns the answer shown in Figure 22.11.

Figure 22.11 : The Glimpse query results for "Interaccess" server access with an error tolerance of 1. Listing 22.8 shows the glimpse.pl gateway code.

Listing 22.8. The glimpse.pl Glimpse gateway.

#!/usr/local/bin/perl # # glimpse.pl # # simple interface to glimpse # # Mark Ginsburg 5/95 ################################################################# $gpath = '/usr/local/bin/'; # Where Glimpse lives. $gdir = '/'; # Glimpse indexes live in root because I ran glimpseindex #as root. require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; # use html_header require 'edgarlib'; read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); # Split the name-value pairs @pairs = split(/&/, $buffer); &html_header("glimpse output"); # # Format the report line. # format STDOUT = @<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<< $clientsite $cdate . # # form an associative array from the Form inputs. # foreach (@pairs) { ($key,$value) = split(/=/,$_); $value = &deweb($value); # clean up the value. $form{$key} = $value; } $err = " -".$form{'error'}; # form the error-level flag that Glimpse uses # # must pipe in a 'y' to the "Continue"? Question or else it hangs. # $pipeans = " < /web/fluff/yes "; $gq = $gpath."glimpse"." -H ".$gdir.$err." ".$form{'company'}.$pipeans; @gans = '$gq'; # glimpse does the work and the @gans array has the answer. print "<TITLE>Glimpse search </TITLE>\n"; print "<h1> Glimpse Access Log Report </h1>"; print "<h2> Company or University: $form{'company'} (errorlevel $form{'error'}) </h2>"; $hit = 0; print " <pre> "; print " \n"; print "<b>"; $clientsite = "Client Site"; $cdate = "Date "; write STDOUT; # Write headers print " </b> \n\n"; # Skip two lines - ready for data # # Now some cosmetics to get the clientsite and the date out of the # httpd server access log. # foreach $elem (@gans) { ($garbage,$prelimsite,$garbage) = split(/:/,$elem); ($clientsite,$garbage) = split(/\s-\s/,$prelimsite); ($garbage,$prelimdate) = split(/\[/,$prelimsite); ($cdate,$garbage) = split(/:/,$prelimdate); write STDOUT; $hit++; if ($hit > $form{'max'}) { # exit if max hits reached print "\n"; print "user limit reached of $form{'max'} - exiting \n"; &home; # make use of a global subroutine exit 1;} } print " </pre> "; print "<h2> Total of $hit hits </h2>"; &home; # &home is a subroutine local to the Edgar site exit 0;

Code Walkthrough: glimpse.pl

I take the user's form input and create an associative array. I clean ("dewebify") the input. The action centers around assembling the Glimpse query.

Note

Be especially careful when assembling a query that will be fed to a software package. For debugging, echo the assembled query on stdout and run it on the command line.

After the query is assembled, I set an array @gans to be equal to the results of the Glimpse query, which is evaluated in back quotes. A little bit of convoluted cosmetic work later, I have presentable HTMLized output. I use the Perl format STDOUT statement to line up the fields.

Some other important points about the glimpse.pl code follow:

The &deweb.pl subroutine is a little piece of code in the global subroutine /usr/local/lib/perl directory to translate hexadecimal codes back to their ASCII equivalents.

&home is a subroutine encountered in Chapter 21 to write a handy go-home text link and graphic.

It is crucial to anticipate the command-line question that Glimpse asks. I pipe in the answer "y"; if I omit it, the gateway script will hang forever waiting for the all-important "y."

To reinforce this point,

glimpse -1 foobar

hangs in this environment. I need to say

glimpse -1 foobar < /path/yes-file

where yes-file contains the single character "y."

Tip

If a script calling an external software package mysteriously hangs but executes quickly on the command line, consider this question: Is the expected answer the only thing returned by the package? And if the script's output looks strange after Perl processes it, is it possible that the package is outputting unprintable characters that might be fouling up the works? To answer the latter question, run the query on the command line and redirect the standard output to a file. Then use an editor (for example, Emacs in hexadecimal mode) to scan the output for unusual ASCII codes.

The cosmetic section presupposed that the end user cares only about the server access site and the date of access; I am deleting all other server access information from the report.

Having done all this preliminary interface work, I can almost throw it all away! Why? Because an HTTP-Glimpse gateway has been prebuilt for the Internet community by Paul Klark.(See note)

This flexible software allows browsing (to find directories where useful information might reside) integrated with Glimpse querying. The matches are hyperlinks to the underlying files, just as WAIS-HTTP gateways provide. As Paul Klark writes, "Following the hyperlink leads you not only to a particular file, but also to the exact place where the match occurred. Hyperlinks in the documents are converted on the fly to actual hyperlinks, which you can follow immediately."

I cannot stress too much this lesson: The developer should look around before coding! Web gateway programming is the problem of using to best advantage dozens of promising building blocks, all interrelated in a dense and tangled mesh. The chances are very good that someone already has started the project that a new web developer is undertaking, or, at the very least, constructed something so similar that it is quite useful.

Pros and Cons of Glimpse

Assuming that a site has a data store it would like to index and query, I recommend Glimpse when a site has disk storage constraints, when it does not need the relevancy scores of WAIS, or when empirical tests show that Glimpse's query speed and accuracy are acceptably close to WAIS. Its ease of use is a definite plus, and a well-established research team is pushing it forward.

The authors mention a few weaknesses in the current version of Glimpse.(See note) I will mention two here:

Because Glimpse's index is word based, it can search for combinations only by splitting the phrase into its individual words and then taking an additional step to form the phrase. If a document contains many occurrences of the word last and the word stand but very few occurrences of the phrase last stand, the algorithm will be slow.

The -f fast-indexing flag does not work with -b medium indexes. The authors note that this is scheduled to be fixed in the next release.

The Glimpse team is to be commended for its excellent on-line reference material, identification of known weaknesses and bugs, porting initiatives, and well-conceived demonstration pages.

Harvest

Harvest, a research project headed by Michael Schwartz at the University of Colorado (the team also includes Udi Manber of Glimpse fame), addresses the very practical problem of reducing the network load caused by the high traffic of client information requests and reducing the machine load placed on information servers.

Harvest is a highly modular and scaleable toolkit that places a premium on acquiring indexing information in an efficient manner and replicating the information across the Internet. No longer is there a curse on the machine that has a popular information store; formerly, that machine would have to bear the burden of answering thousands of text-retrieval requests daily. With Harvest, one site's content can be efficiently represented and replicated.

The first piece of the Harvest software is the Gatherer. The Gatherer software can be run at the information provider's machine, thus avoiding network load, or it can run using FTP or HTTP to access a remote provider. The function of the Gatherer is to collect the indexing information from a site. The Gatherer takes advantage of a highly customizable extraction software known as Essence, which can unpack archived files, such as tar (tape archive) files, or find author and title lines in Latex documents. The Essence tool, because it easily can be manipulated at the information site, will build a high-quality index for outbound distribution.

The second piece is the Broker. The Gatherer communicates to the Broker using a flexible protocol that is a stream of attribute/value pairs. Brokers provide the actual query interface and can accommodate incremental indexing of the information provided by Gatherers.

The power of the Gatherer-Broker system is in its use of the distributed nature of the Internet. Not only can one Gatherer feed many Brokers across the Net, but Brokers also can feed their current index to other brokers. Because distributed Brokers may possess different query interfaces, the differences may be used to filter the information stream. Harvest provides a registry system, the Harvest Server Registry (HSR), which maintains information on Gatherers and Brokers. A new information store should consult the registry to avoid reinventing the wheel with its proposed index, and an information requester should consult the registry to locate the most proximate Brokers to cut down on search time.

After the user enters a query to a Harvest Broker, a search engine takes over. The Broker does not require a specific search engine-it might be WAIS, freeWAIS, Glimpse, or others. Glimpse is distributed with the Harvest source and has, as already mentioned, very compact indexes.

Another critical piece of the Harvest system is the Replicator. This subsystem is rather complicated, but there are daemons overseeing communication between Brokers (which are spread all over the Internet on a global scale) and determining the extent, timing, and flow of replicated information. The upshot is that certain replication groups flood object information to the other members of the same group and then between groups. Thus, a high degree of replication is achieved between neighbors in the conceptual wide-area mapping and convergence toward high replication in less proximate Brokers over time.

Any information site can acquire the Harvest software, run Gatherer to acquire indexing information, and then make itself known to the Harvest registry. Web developers who want to reduce load on a popular information store are strongly advised to do more research on Harvest and its components.

Figure 22.12 shows the Internet Multicasting Service's EDGAR input screen to a Harvest query. The user does not need to know which retrieval engine is bundled with the Harvest software; it might be WAIS or it might be Glimpse. Because Harvest is highly modular, it is easy to swap index engines and retrieval engines. Observe the similarities to a WAIS screen.

Figure 22.12 : A Harvest query is started at the Internet Multicasting System's EDGAR site.

Figure 22.13 shows the response. Remember that the search engine chosen is up to the Broker.

Figure 22.13 : The response from a Harvest query with options to see the object methods.

In summary, Harvest is another example of an excellent research team providing a fascinating new tool to accommodate efficient, distributed text search and retrieval. Because every subsystem (Gatherer, Broker, and Searcher) is highly customizable, and Harvest automatically handles the replication of Broker information, the web developer should keep a close eye on the Colorado team as further developments unfold. The most recent turn of events is a commercial spin-off of Harvest, at http://www.netcache.com/, to market a highly optimized HTTP proxy caching server. The research wing of object caching development continues in parallel at http://www.nlanr.net/Squid/.

Text Search and Retrieval Tools Check

Every Web site has a different information content structure. The developer should be able to match the characteristics of some or all of the data with one or more appropriate search and retrieval tools. If a simpler tool is sufficient, there is no need to implement the more complex tool. Needing no text indexing or retrieval at all is a perfectly valid condition at some Web sites.
Many packages are available to accomplish Web indexing; the developer should experiment with several in order to evaluate the strengths and weaknesses.
Interface programming with complex tools such as freeWAIS-sf and Glimpse can be tricky. The developer should become familiar with the command-line behavior of both the indexing and the retrieval process and be able to debug misbehaving front-end applications.
System benchmarking should be performed for more complex indexing jobs. If the package allows, incremental indexing should be used whenever possible to speed up the job. Both indexing and retrieval can be memory intensive, and the developer should be aware of constraints imposed by the site's hardware.

Footnotes

"The Z39.50 Protocol in Plain English," by Clifford A. Lynch, is available at http://ds.internic.net/z3950/pe-doc.txt.

http://cnidr.org/welcome.html is the home page of the Center for Networked Information Discovery and Retrieval.

http://vinca.cnidr.org/software/zdist/zdist.html has pointers to a mailing list and source code and summarizes the components of ZDist.

http://www.wais.com/ is the corporate home page of Brewster Kahle's WAIS, Inc.

Tim Mann, the wizard of xboard, can be found on-line at the Internet Chess Club (ICC)-telnet chess.lm.com 5000 and then type finger mann. There is now winboard, a port of xboard to 32-bit MS-Windows systems.

http://charly.informatik.uni-dortmund.de/freeWAIS-sf/ has more information on freeWAIS-sf's features and history, and the source distribution is available here as well. Ulrich Pfeiffer's home page is http://charly.informatik.uni-dortmund.de/~pfeifer/.

The Glimpse home page is at http://glimpse.cs.arizona.edu/ and the manual pages are on-line at http://glimpse.cs.arizona.edu/glimpsehelp.html.

Paul Klark's glimpseHTTP software distribution, currently at version 1.4, can be fetched from ftp://cs.arizona.edu/glimpse/glimpseHTTP.1.4.src.tar.

Glimpse's current limitations are on-line at http://glimpse.cs.arizona.edu/glimpsehelp.html#sect14.

Chapter 22

Gateway Programming II: ext Search and Retrieval Tools

CONTENTS

Forms-Based WAIS Query Examples

The Standard wais.pl Interface

Some Observations about wais.pl

Debugging the WAIS Interface

Another Way to Ask a WAIS Query

Observations about waissearch.pl

Code Discussion: wais-8k.pl

Pros and Cons of WAIS and WAIS-Like Packages

Building Blocks of an Excite Web Server Intranet Solution

Glimpse Indexing

A Practical Test of Glimpse

Setting Up a Glimpse Query

Building a Web Interface to Glimpse

Code Walkthrough: glimpse.pl

Pros and Cons of Glimpse

Footnotes

The Standard `wais.pl` Interface

Some Observations about `wais.pl`

Observations about `waissearch.pl`