Chapter 33

A Hypertext News Interface


CONTENTS


In this chapter, I describe my Web interface to Usenet news archives, which is known as HURL: The Hypertext Usenet Reader and Linker. I show samples of the interface, discuss some of the decisions that I made, and explain how I implemented the interface.

Problem Definition

In this section, I explain the project's history and discuss some of the objectives that the interface was intended to accomplish.

Project History

This project started as an attempt to make a large archive of articles from the Usenet newsgroup talk.bizarre accessible to people using a mail server, FTP, and the World Wide Web. My work on the Web version of this project was my first exposure to the Web. I've been hooked ever since!

In the early days of this project, I came to realize that there are huge collections of news archives scattered across the Internet, but most of these archives were virtually inaccessible due to the lack of a good way to search and browse them. Many of these archives contain useful information, such as the archives of rec.food.recipes, rec.arts.movies.reviews, or comp.lang.perl, and would be much more useful if they had a friendlier interface than FTP or Gopher.

With this in mind, I decided to generalize the talk.bizarre project into an interface to any Usenet news archive. Also, because Internet mail messages have a similar format to Usenet articles, I recently have added support for mailing list archives to HURL.

Cameron Laird maintains a comprehensive list of all Usenet news archives at http://starbase.neosoft.com/~claird/news.lists/newsgroup_archives.html.

Design Constraints

Early in the talk.bizarre project, I had to choose between writing a script to convert each article to an HTML version and storing that version instead of the original article file (because storing both versions would double the size of the archive), or keeping the news article data in the original format (that is, the standard news article format as defined in RFC 1036, with a single news article in each file) and converting to HTML on-the-fly with a CGI script.

I decided to do the latter, mainly because

I wanted the archive to be accessible to people using FTP as well as the Web, which requires that a plain text version be available.
Serving the articles through a CGI script allows for an extra element of interactivity, making it possible to customize the links on each article and the interface, depending on the current user's needs.
It just made sense to keep a version in the original format!

This method also has some drawbacks, however; serving each article through a CGI script increases the load on the server machine and makes it impossible for people using caching proxy servers to benefit from keeping a local copy of articles (which is slightly different each time).

Another early design decision I made was that the main way "in" to the archive would be to enter a search to find a specific set of articles. Some other similar projects, such as the popular hypermail interface to mail archives, require the user to browse through messages by first selecting a time slice of the archive (such as a quarter of a year), and then viewing the subjects and authors of all messages sent during that time period. Due to the tremendous size of the talk.bizarre archive (more than 150,000 articles, or some 300MB of text), this type of browsing is not practical.

Although this might seem limiting to someone who just wants to browse through an archive (which might be likely for smaller archives), the HURL administrator can create some other entry points to the archive by creating a set of predefined queries that can be directly linked to from the archive's main page. (See Fig. 33.4 later in this chapter for an example.)

A final design constraint that I observed early on was to make HURL as widely installable as possible. Any software that HURL relied on had to be freely distributable and couldn't rely on some esoteric feature that isn't found on most UNIX systems.

The Implementation Process

One of the most rewarding aspects of developing projects on the World Wide Web (and the Internet in general) is the availability of people willing to participate as beta testers and to give you feedback on your project before it is completed.

When developing this project, one of the first things I did was put up a prototype interface, which allowed others and me to experiment and discuss what features we thought would be useful in such an interface. After I had this prototype working, I updated the interface and released new versions based on the feedback from these early users.

This method seemed to work extremely well; within hours of announcing a new version of the interface, I would have suggestions from users in my mailbox and could start working on incorporating their ideas into HURL.

An Overview of the Interface

In this section, I discuss each component of the HURL interface and describe the way in which these components are related to one other.

The Query Page

As I mentioned earlier, the main entry point to a HURL archive is the query page, where the user enters a search for text contained in certain article header elements such as the Subject, Keywords, or From lines. (The specific article headers that can be searched depend on how HURL was configured by the administrator.) See Figure 33.1 for an example of the query page.

Figure 33.1: The query page.

I had a hard time designing this HTML form to be powerful enough to allow for queries of reasonable complexity, yet simple enough to be clean looking and not confusing to a novice user. It was tempting to allow for arbitrary logic combinations and to have separate check boxes for each header field independently instead of applying them to the form as a whole, but that would have made the form too complicated.

This form is submitted to a CGI script, which interprets the values given in the form, performs the specified search, and returns a list of messages that match the user's query. (I discuss the details of the implementation of this CGI script in a later section.)

The Message List Browser

A query result returns a list of messages that match the specified search criteria. Because these lists often can be quite long, they are split into separate pages with links at the top and bottom of each page to scroll through the list.

For each message in the list, a single line is displayed listing the date, author, and subject of the article, with a link from the subject to retrieve the article itself. The table is aligned using <PRE>-formatted text.

I called this part of the HURL interface the Message List browser because I wanted to allow it to be used not only for query results, but for any arbitrary list of messages. Users could create a list of their favorite messages from a newsgroup, for example, and then use HURL's message list browser to browse through and view those messages. Figure 33.2 shows a sample of the Message List browser.

Figure 33.2: The Message List browser.

The Article Page

Selecting an article from a message list produces an article page for that article, complete with hypertext links to other articles that are related to it in some way. In this section, I discuss some of the various components of the article page.

Overall Structure

At the top of each article page is a navigation bar, followed by the article header (slightly reformatted), message body, and a footer that identifies the archive and maintainer. If the message body appears to contain quoted text from another message, the quoted text is italicized to make it stand out from the original text in the article.

I decided to display the article pages in <PRE>-formatted text in order to be consistent with the article's original format as posted to Usenet-plain monospaced text. Some people would argue that this makes text ugly and difficult to read, but in general it is impossible to automatically decide whether a Usenet article can be safely reflowed, with the possible exception of newsgroups such as rec.arts.movies.reviews that have a predictable paragraph structure. Figure 33.3 shows a sample article page.

Figure 33.3: An article page.

Icon Navigation Bar

At the top of each article page, there is a row of buttons with links to the next or previous article by date, author, or in the currently selected list of articles (which is typically a query result). If one of these functions is unavailable for the current article (for example, if there is no next article by the same author, or if you're viewing the last article in a list), the icon is dimmed and doesn't receive a link. This prevents the user from following an invalid link and getting an annoying error message such as No next article by this author.

Other icons that appear at the top of each article have links to the following:

Article Header Lines

The article header for Usenet archives is displayed in the following way:

For mailing list archives, the header appears in a similar manner, except the In-Reply-To header line acts as the References line, and To and Cc lines are used in place of the Newsgroups line.

Links within Articles

Any e-mail addresses or message-ID references that appear within the body of an article also get links to the author page for that person and the article referenced, respectively. These links appear only if the author or article currently exists in the archive; this was somewhat difficult to implement, and is one of HURL's strong points over similar interfaces. (How this was done is discussed a bit later!) HURL also places links on any URLs that it sees within articles, although it doesn't try to verify whether they actually work.

A Sample Archive's Home Page

Although the query page is the main entry point to a HURL archive, the administrator can make a newsgroup-specific home page that's customized to the needs of the newsgroup in question. One way this can be done is by creating predefined queries using hypertext links with the query information encoded in the URLs. Figure 33.4 shows a sample of what such a page could look like for an archive of the comp.infosystems.www.announce newsgroup.

Figure 33.4: A sample archive's home page.

The Implementation

In this section, I discuss the implementation behind the interface: the behind-the-scenes magic that makes this interface work, including the archive build process and the CGI scripts that tie everything together.

HURL is implemented entirely in Perl, with the exception of a single routine written in C that will be replaced with a Perl version in a future release. I found Perl to be ideal for this project, due to its excellent text-processing features and built-in support for the UNIX DBM database format. Although I hadn't used Perl before this project, I found it easy to learn because it's largely based on other UNIX tools that already were familiar to me.

The Build Process

In order for the Web interface to be fast enough to be usable, some information about each article is precalculated and stored in a database. The builds can happen as frequently or as infrequently as desired by the system administrator; any new articles that are added to the archive show up in the Web interface when the indexes are rebuilt. A typical configuration would have the indexes updated automatically at night when the machines are less busy.

The build follows approximately these steps:

  1. An initial run is made through all the articles in the archive, generating indexes for later use in the build process, such as lists of all valid message IDs and authors in the archive, and tables that correlate message IDs with authors, dates, and file names (with the dates normalized to a common integer format rather than the wildly varying format found in news articles).
  2. The tables created in the first step are sorted by author and date, and then are used to create next and previous links by author and date in the database. These tables also are used to create a database containing information and statistics about the authors, such as the number of articles they have posted and the dates of their first and last posts to that newsgroup.
  3. Another run is done through the articles, looking for e-mail addresses and message-ID references within each article. For each possible reference found, it is checked to see whether it is a valid reference (that is, to see whether it points to something that actually exists in the archive) and, if so, it is stored in the database.
  4. A final pass through the articles is done, generating indexes of the headers to be used for the query script. These indexes currently are just plain text files with a single line for each article in the archive, and with a separate file for each header element that has been configured as being searchable by the HURL administrator.

The Database Format

Almost all of the precalculated data is stored in a database format that's standard on all UNIX systems, called DBM. DBM databases are simply collections of key/value pairs of data, stored using a hash table. Perl makes it extremely easy to use DBM databases, because it allows for DBM files to be bound to associative arrays.

This means that after you bind a DBM file to an array with the dbmopen function call, you can store and retrieve values in the database with any operations you normally would use on associative arrays. For example,

dbmopen( DBFILE, "dbfile", 0600 );
$DBFILE{'foo1'} = "bar1";
$DBFILE{'foo2'} = "bar2";
dbmclose( DBFILE );

This creates a DBM database containing two keys, "foo1" and "foo2", with values "bar1" and "bar2", respectively. These values can be retrieved later in the same manner:

dbmopen( DBFILE, "dbfile", 0600 );
print "The value stored in key 'foo2' is: $DBFILE{'foo2'}\n";
print "The value stored in key 'foo1' is: $DBFILE{'foo1'}\n";
dbmclose( DBFILE );

Because DBM files are implemented as hashed table entries, retrieval of these values is very fast, even for large databases.

For HURL, the keys that I used were the message IDs for each article, which uniquely identify articles in the archive. The value stored for each message-ID key is the information that I calculated for each article during the build process: the links that belong on each of the icons in the navigation bar and the valid message-ID and e-mail address references within the article body.

Because all this information is precalculated and stored in a way that makes retrieval efficient, the CGI script to output an article in the HTML format is extremely fast; it just has to retrieve a single value from the DBM database using the (known) message ID as the key and then output the article along with the appropriate hypertext links. Because the article page is the most often requested item in the interface, I decided that it would be good to make this part of HURL as efficient as possible in order to decrease the load on the Web server.

Executing Queries

Queries are performed by taking the input entered by the user on the query page, processing it slightly, and then opening a pipe to an external command to search for the text with the specified options.

The external command used is an extremely fast variant of the UNIX grep program called agrep, written by Udi Manber and Burra Gopal of the University of Arizona, and Sun Wu of the National Chung-Cheng University, Taiwan. agrep is the basis for the file system indexing tool called Glimpse; both of these are available with source code from http://glimpse.cs.arizona.edu:1994/.

Using the agrep program against a flat text file isn't the most sophisticated method of performing large-scale text searches, but there are at least a few advantages to this approach:

This method already is fast enough to be quite useful, but it eventually might be replaced with a different indexing scheme based on a more powerful database. I kept this in mind when writing the current query script and made it somewhat modular so that another query system could be written without having to redo everything else. (In fact, two query systems could coexist quite nicely, providing complementary features.)

The query script finds lists of articles that match the search criteria specified, combines these results, and creates a file on the server with a list of message IDs and file names. This list is in turn used by the Message List Browser script, which outputs a table of the messages found with links to the messages themselves.

Maintaining State between Pages

Because the current version of HTTP is stateless, some extra work needs to be done to pass information from one CGI script to the next. I accomplished this by passing some extra parameters along with each CGI script and outputting these parameters on any new pages created by my scripts. (This is a common method of passing state information.)

To explain this better, it might help to look at a sample URL:

http://servername/bin/browse?jiagvyfcn&pos=101

This URL is one that could be given to the Message List Browser script, and it says that I want to browse the message list identified by jiagvyfcn, starting at position 101. The word jiagvyfcn looks like nonsense, but it's just a random collection of letters that was generated by the query script so that it would have some way to refer to the newly created list of messages. It also happens to be the name of the file stored on the server that contains the list of message IDs and file names that were found by the query script.

When the Message List Browser script is called with a URL like this one, it parses the text after the question mark (which is found in the QUERY_STRING environment variable), and then opens the specified file on the server and skips the first 100 messages on the list because the URL said that I want to start with the 101st message. It then displays a page of the next 100 messages, along with links to the next and previous pages.

Selecting a link from a message list typically results in an article page being displayed with a URL like this:

http://servername/bin/message-ID?foo@bar.com&browsing=jiagvyfcn

This URL references the message in the archive with a message ID of foo@bar.com and indicates that we currently are browsing the list of messages referenced by jiagvyfcn. The message-ID script retrieves the DBM entry for foo@bar.com, which contains the file name and all the necessary link information for that message. The jiagvyfcn information is used to put a link back to browse?jiagvyfcn&pos=xx from the article page and to determine which messages should be linked to from the next and previous pages in list icons in the navigation bar.

Some Advanced Features

In this section, I discuss some of the more advanced features of the interface and the new ideas I'm considering for the future.

Article Filters

One of the links in the navigation bar on the article page produces a secondary page with a number of filters that can be applied to the current article. These filters currently include things such as performing rot13 decryption (a trivial encryption method used for news articles) or adding a link on each word to a dictionary, thesaurus, or jargon file elsewhere on the Web. The potential for this is unlimited; as more and more of these gateways open up on the Web, HURL will be able to take advantage of them simply by adding an extra filter definition.

Another such filter that I've been experimenting with recently is one that places links on any words it recognizes as being Perl function calls to the description for that function in an on-line Perl manual. Imagine reading through an archive of comp.lang.perl and being able to find out more information about a function just by clicking on part of someone's code example!

URL-Based Queries

There is another method to enter queries against the archive besides using the HTML form. By entering or linking to a URL that has been constructed with the appropriate syntax, an on-the-fly query is performed with the same results as would occur had the regular fields been filled out.

This is invaluable as a method to link to a query result. On each author page, for example, there is a link to a list of that author's articles, but instead of creating lists of articles for each author beforehand, HURL simply creates a link to an on-the-fly query for articles emanating from that e-mail address. This technique also was used on the home page for the comp.infosystems.www.announce archive to create predefined queries (as displayed in Fig. 33.4).

An address for a URL query looks something like this:

http://somewhere/bin/query?Subject=something.interesting

It turned out that this was extremely easy to implement; whenever the query script is called with the GET method rather than the usual POST method that is used with the HTML form, fields that look like "Subject=something" are massaged into the multiple-variable format used by the POST method, and the script resumes normal execution. There was no need to write an extra query script or to go to great lengths to specify and decode an extra format for specifying queries.

Browser-Dependent Customizations

Recently, I have started to add code in a few places to tweak the output of some of the CGI scripts slightly to compensate for the special needs of various browsing platforms.

A normal query result might end up being wider than 80 columns due to long Subject lines, for example, but this isn't a problem with a graphical browser because the horizontal scrollbar can be used and the extra-long lines aren't wrapped or truncated. With a text-based client such as Lynx, however, <PRE>-formatted text that is wider than 80 columns is broken across multiple lines and becomes difficult to use.

I therefore added code that checks the value of the HTTP_USER_AGENT variable, and if the script is being called with Lynx, the output of the CGI scripts automatically is changed to fit within an 80-column screen by making each of the fields in the Message List browser slightly narrower and truncating overly long subjects.

In the future, I'll likely take this concept a bit further and start returning true HTML 3.2 tables for browsers that can support them and HTML 2.0 <PRE>-formatted tables for browsers that can't. In general, I don't agree with using CGI scripts specifically for this purpose, because a single HTML document normally can be viewed everywhere if it's properly constructed. In this case, though, the documents already are being served by CGI scripts, and it's trivial to add a few lines of code to customize the output slightly.

Future Plans

Although HURL is quite usable in its current form, I plan to continue its development in the future, adding new filters and browsing options and increasing the level of customizability.

Some of the specific features I plan to implement follow:

Article threading  Most modern Usenet newsreaders allow for discussion threads to be navigated in a hierarchical manner; it would be nice to support this in HURL as well.

Full-text searching  Currently, queries can be performed only against article header elements, but it often is useful to search for words within the articles themselves. A future version of HURL will provide this capability. (I already have had good results with this using the Glimpse file system indexer.)

Incremental indexing  The original talk.bizarre archive project didn't require the indexes to be up to date, so I envisioned the builds taking place on something like a weekly basis. Recent experience with archives of other newsgroups, however, has shown that it is desirable to be able to add articles to the build on a daily or even a more frequent basis.

Increased customizability  The interface will become increasingly flexible with regard to the various views of the information contained in a news archive. A user will be able to specify different header fields to be shown in the Message List browser, for example, instead of the default Date, From, and Subject fields.

Hypertext News Interface Check