Venice Beach, California.

Chapter 10: Sites That Are Really Programs

by Philip Greenspun, part of Philip and Alex's Guide to Web Publishing

Venice Beach, California. The classic (circa 1993) Web site comprises static .html files in a Unix file system. This kind of site is effective for one-way non-collaborative publishing of material that seldom changes.

You needn't turn your Web site into a program just because the body of material that you are publishing is changing. Sites like http://www.yahoo.com, for example, are sets of static files that are periodically generated by programs grinding through a dynamic database. With this sort of arrangement, the site inevitably lags behind the database but you can handle millions of hits a day without a major investment in computer hardware, custom software, or thought.

If you want to make a collaborative site, however, then at least some of your Web pages will have to be computer programs. Pages that process user submissions have to add user-supplied data to your Web server's disk. Pages that display user submissions have to look through a database on your server before delivering the relevant contributions.

Even if you want to publish completely static, non-collaborative material, at least one portion of your site will require server-side programming: the search engine. To provide a full-text search over your material, your server must be able to take a query string from the user, compare it to the files on the disk, and then return a page of links to relevant documents.

This chapter discusses the options available to Web publishers who need to write program-backed pages. Here are the steps:

  1. Decide whether you're building a dynamic document or a program with a Web interface.
  2. Choose a computer language.
  3. Choose a program invocation mechanism.
  4. Choose a Web server program to support the first three choices.

Step 1: Document or Program?

A document is typically edited or updated over a period of days. Changes in one portion of a document don't have far-reaching implications for other portions. A computer program is typically versioned, debugged, and tested over a period of months.

Every interesting Web site has some characteristics of both a document and a computer program. There is thus no correct answer to the question "Is your site a hypertext document with bits of computation or a computer program with bits of static text?" However, the tools that make it easy for a team of experts to develop a computer program will get in the way if your site is fundamentally a document. Conversely, the tools that make it convenient to edit a document can lead to sloppy and error-filled computer programs.

Server-side programming systems that take the document model to its logical extreme are AOLserver Dynamic Pages (ADP) and Microsoft Active Server Pages (ASP). A vanilla HTML file is a legal ADP or ASP document. If you want to add some computation, you weave in little computer language fragments, surrounded by <% ... %>. If you want to fix a typo or a programming bug, you edit the .adp or .asp file and hit reload in your Web browser to see the new version. Almost always, the connection is direct and immediate between the URL where the problem was observed and the file on the server that you must edit. You don't have to understand much of the document's structure to fix a bug.

At the other end of the document/program spectrum are various "application servers" that require you to program in C or Java. HTML text is inevitably buried inside these programs. Fixing a typo requires editing the program, compiling the program, and reloading the compiled code into the Web or application server. If there is a problem with a URL, fixing it might require reading and editing dozens of program files and understanding most of the program's overall structure.

With the right tools and programmer resources, you can build a jewel-like software system to sit behind a Web site. But ask yourself whether the entire service isn't likely to be redesigned after six months, and if, realistically, your site isn't going to be thrown together hastily by overworked programmers. If so, perhaps it will be best to look for the tightest development cycle.

Step 2: Choose a Computer Language

People usually choose a computer language according to how well it supports management of complexity. Much of the complexity in a web site, however, is in the number of URLs and how they interact (i.e., how many form-variable arguments get passed from one page to another). So the system structure is very similar regardless of the computer language employed.

Consider these aspects:

You would think that picking a Web site development language would be trivial. Obviously the best languages are safe and incorporate powerful object systems. So let's do everything in Common Lisp or Java. Common Lisp can run interpeted as well as compiled, which makes it a more efficient language for developers. So Common Lisp should be the obvious winner of the Web server language wars. Yet nobody uses Common Lisp for server-side scripting. Is that because Java-the-hype-king has crushed it? No. In fact, to a first approximation, nobody uses Java for server-side scripting. Almost everyone is using simple interpreted languages such as Perl, Tcl, or Visual Basic.

How could a lame scripting language like Tcl possibly compete with Lisp? At some level, the only data type available in Tcl is a string. Well, guess what? The only data type that you can write to a Netscape browser is a string. And all the information from the Oracle relational database management system on which you are relying comes back to you as strings. So maybe it doesn't matter whether your scripting language has an enfeebled type system.

Are these languages really the best? My computer science friends would shoot me for saying that Tcl is as good as Common Lisp and better than Java. But it turns out to be almost true. Tcl is better than Java because Tcl doesn't have to be compiled. Tcl can be better than Lisp because string manipulation is simpler. For example, in Tcl


"posted by $email on $posting_date."
will generate a string from the fragments of static ASCII above plus the contents of the variables $email and $posting_date. These were presumably recently pulled from a relational database. The result might look something like

"posted by philg@mit.edu on February 15, 1998."
In Common Lisp, you'd have

(concatenate 'string "posted by " email " on " posting-date ".")
which uses a fabulously general mechanism for concatenating sequences. concatenate can work on sequences of ASCII characters (strings) or sequences of TCP packets or sequences of three-dimensional arrays or sequences of double-precision complex numbers. Sequences can either be lists (fast to modify) or vectors (fast to retrieve). This kind of flexibility, which Java apes, is wonderful except that Web programmers are concatenating strings 99.99 percent of the time and Tcl's syntactic shortcuts make code easier to read and more reliable.

What's my prediction for the powerful language that will sit behind the Web sites of the future?

HTML.

HTML? But didn't we spend a whole chapter talking about how deficient it was even as a formatting language? How can HTML function as a server-side programming language?

Server-Parsed HTML

In the beginning, there was server-parsed HTML. You added an HTML comment to a file, as, for example
<!--#include FILE="/web/author-info.txt" -->
and then reloaded the file in a browser.

Nothing changed. Anything surrounded by "<!--" and "-->" is an HTML comment. The browser ignores it.

Your intent, though, was to have the Web server notice this command and replace the comment with the contents of the file /web/author-info.txt. To do that, you have to change the file name of this URL to have an .shtml extension. Now the server knows that you are actually programming in an extended version of HTML.

The AOLserver takes this one step further. To the list of standard SHTML commands, they've added #nstcl:

<!--#nstcl script="ns_httpget "http://cirrus.sprl.umich.edu/wxnet/fcst/boston.txt" -->
which lets a basically static HTML page use the ns_httpget Tcl API function to go out on the Internet, from the server, and grab http://cirrus.sprl.umich.edu/wxnet/fcst/boston.txt before returning the page to the user. The contents of http://cirrus.sprl.umich.edu/wxnet/fcst/boston.txt are included in place of the comment tag.

This is a great system because a big Web publisher can have its programmers develop a library of custom Tcl functions that its content authors simply call from server-parsed HTML files. That makes it easy to enforce style conventions company-wide. For example,

<!--#nstcl script="webco_captioned_photo samoyed.jpg {This is a Samoyed}" -->
might turn into
<h3>
<img src="samoyed.jpg" 
     alt="This is a Samoyed">
This is a Samoyed
</h3>
until the day that the Webco art director decides that HTML tables would be a better way to present these images. So a programmer redefines the procedure webco_captioned_photo, and the next time they are served, thousands of image references instead turn into
<table>
<tr>
  <td><img src="samoyed.jpg" 
           alt="This is a Samoyed">
  <td>This is a Samoyed
</tr>
</table>

HTML As a Programming Language

As long as we're programming our server, why not define a new language, "Webco HTML?" Any file with a .whtml extension will be interpreted as a Webco HTML program and the result, presumably standard HTML, will be served to the requesting users. Webco HTML has the same syntax as standard HTML, just more tags. Here's the captioned photo example:
<CAPTIONED-PHOTO "samoyed.jpg" "This is a Samoyed">
Just like the Tcl function, this Webco HTML function takes two arguments, an image file name and a caption string. And just like the Tcl function, it produces HTML tags that will be recognized by standard browsers. I think it is cleaner than the "include a Tcl function call" .shtml example because the content producers don't have to switch back and forth between HTML syntax and Tcl syntax.

How far can we go with this? Pretty far. The best of the enriched HTMLs is Meta-HTML (http://www.metahtml.com). Meta-HTML is fundamentally a macro expansion language. We'd define our captioned-photo tag thusly:

<define-tag captioned-photo image-url text>
  <h3>
    <img src="<get-var image-url>" alt="<get-var text>"> <br>
    <get-var text>
  </h3>
</define-tag>

Now that we are using a real programming language, though, we'd probably not stop there. Suppose that Webco has decided that it wants to be on the leading edge as far as image format goes. So it publishes images in three formats: GIF, JPEG, and progressive JPEG. Webco is an old company so every image is available as a GIF but only some are available as JPEG and even fewer as progressive JPEG. Here's what we'd really like captioned-photo to do:

  1. Change the function to take just the file name as an argument, with no extension; for example, "foobar" instead of "foobar.jpg".
  2. Look at the client's user-agent header.
  3. If the user-agent is Mozilla 1, then look in the file system for foobar.jpg and reference it if it exists (otherwise reference foobar.gif).
  4. If the user-agent is Mozilla 2, then look in the file system for foobar-prog.jpg (progressive JPEG) and reference it; otherwise look for foobar.jpg; otherwise reference foobar.gif.

This is straightforward in Meta-HTML:

<define-function captioned-photo stem caption>
  ;;; If the user-agent is Netscape, try using a JPEG format file
  <when <match <get-var env::http_user_agent> "Mozilla">>
    ;;; this is Netscape
    <when <match <get-var env::http_user_agent> "Mozilla/[2345]">>
      ;;; this is Netscape version 2, 3, 4, or 5(!)
      <if <get-file-properties
         <get-var mhtml::document-root>/<get-var stem>-prog.jpg>
          ;;; we found the progressive JPEG in the Unix file system
         <set-var file-to-reference = <get-var stem>-prog.jpg>>
    </when>
    ;;; If we haven't defined FILE-TO-REFERENCE yet, 
    ;;; try the simpler JPEG format next.
    <when <not <get-var file-to-reference>>>
      <if <get-file-properties
            <get-var mhtml::document-root>/<get-var stem>.jpg>
          <set-var file-to-reference = <get-var stem>.jpg>>
    </when>
  </when>
  ;;; If FILE-TO-REFERENCE wasn't defined above, default to GIF file
  <when <not <get-var file-to-reference>>>
    <set-var file-to-reference <get-var stem>.gif>
  </when>
  ;;; here's the result of this function call, four lines of HTML
  <h3>
  <img src="<get-var file-to-reference>" alt="<get-var caption>"> 
  <br>
  <get-var caption>
  </h3>
</define-function>

This example only scratches the surface of Meta-HTML's capabilities. The language includes many of the powerful constructs such as session variables that you find in Netscape's LiveWire system. However, for my taste, Meta-HTML is much cleaner and better implemented than the LiveWire stuff. Universal Access offers a "pro" version of Meta-HTML compiled with the OpenLink ODBC libraries so that it can talk efficiently to any relational database (even from Linux!).

Is the whole world going to adopt this wonderful language? Meta-HTML does seem to have a lot going for it. The language and first implementation were developed by Brian Fox and Henry Minsky, two hard-core MIT computer science grads. Universal Access is giving away their source code (under a standard GNU-type license) for both a stand-alone Meta-HTML Web server and a CGI interpreter that you can use with any Web server. They distribute precompiled binaries for popular computers. They offer support contracts for $500 a year. If you don't like Universal Access support, you can hire the C programmer of your choice to maintain and extend their software. Minsky and Fox have put the language into the public domain. If you don't like any of the Universal Access stuff, you can write your own interpreter for Meta-HTML, using their source code as a model.

http://www.yobaby.com/photo4u/.

-->

Trouble in Paradise

Atlantic City, New Jersey. The biggest problem with extended HTML is that it is not HTML. The suite of tools that you're using to work with HTML may not work with whatever HTML extensions you've adopted. For example, suppose that you are using the CAPTIONED-PHOTO tag throughout your content. You hire a writer to update some of your pages. He downloads them in Netscape Navigator, at which time your server converts them into standard HTML TABLE, TR, and TD tags. He edits the document in Netscape Composer and uses HTTP PUT or FTP to place it back on the server. At this point, all the CAPTIONED-PHOTO tags have been lost and with them your insurance against changes in the HTML standard.

So if someone offers you even a minor variation on HTML, ask him what tools he's developed and how the new language will fit into all of your production processes.

Step 3: Choose a Program Invocation Mechanism

Otters.  Audubon Zoo.  New Orleans, Louisiana. What happens after the user requests a dynamic page? How does the server know that a program needs to be called and what does it have to do to run that program?

The oldest and most common mechanism for program invocation via the Web is the Common-Gateway Interface (CGI). The CGI standard is an abstraction barrier that dictates what a program should expect from the Web server, for example, user form input, and how the program must return characters to the Web server program for them to eventually be written back to the Web user. If you write a program with the CGI standard in mind, it will work with any Web server program. You can move your site from NCSA HTTPD 1.3 to Netscape Communications 1.1 to AOLserver 2.1 and all of your CGI scripts will still work. You can give your programs away to other webmasters who aren't running the same server program. Of course if you wrote your CGI program in C and compiled it for an HP Unix box, it isn't going to run so great on a Windows NT machine.

Oops.

We've just discovered why most CGI scripts are written in Perl, Tcl, or some other interpreted computer language. The systems administrator can install the Perl or Tcl interpreter once and then Web site developers on that machine can easily run any script that they download from another site.

Fixing a bug in an interpreted CGI script is easy. A message shows up in the error log when a user accesses "http://yourserver.nerdu.edu/bboard/subject-lines.pl". If your Web server document root is at /web (my personal favorite location), then you know to edit the file /web/bboard/subject-lines.pl. After you've found the bug and written the file back to the disk, the next time the page is accessed the new version of the subject-lines Perl script will be interpreted.

For concreteness, let's summarize Unix CGI:

Here's an example program:
#!/usr/contrib/bin/perl
# the first line in a Unix shell script says where to find the
# interpreter. If you don't know where perl lives on your system, type
# "which perl", "type perl", or "whereis perl" at any shell
# and put the result after the #!
print "Content-type: text/html\n\n";
# now we have printed a header (plus two newlines) indicating that the
# document will be HTML; whatever else we write to standard output will
# show up on the user's screen
print "<h3>Hello World</h3>";

This example program will print "Hello World" as a level-3 headline. If you want to get more sophisticated, read some on-line tutorials, The Cgi/Perl Cookbook (Patchett & Wright; Wiley 1997), or CGI Programming on the World Wide Web (Gundavaram; O'Reilly, 1996).

It is that easy to write Perl CGI scripts and get server independence, a tight software development cycle, and ease of distribution to other sites. With that in mind, you might ask how many of my thousands of dynamic Web pages use this program invocation mechanism. The answer? One. It was written by Architext and it looks up user query strings in the site's local full-text index. Why don't I have more?

Enter the server application programming interface (API). As I discussed in the "So You Want to Run Your Own Server" chapter, most Web server programs allow you to supplement their behavior with extra software that you write. This software will run inside the Web server's process, saving the overhead of forking CGI scripts. Because the Web server program will generally run for at least 24 hours, it becomes the natural candidate to be the RDBMS client.

All Web server APIs allow you to specify "If the user makes a request for a URL that starts with /foo/bar/ then run Program X". The really good Web server APIs allow you to request program invocation before or after pages are delivered. For example, you ought to be able to say "When the user makes a request for any HTML file, run Program Y first and don't serve the file if Program Y says it is unhappy". Or "After the user has been served any file from the /car-reviews directory, run Program Z" (presumably Program Z performs some kind of logging).

Step 4: choose a Web server program to support the first three choices

Remember the steps:
  1. Decide whether you're building a dynamic document or a program with a Web interface.
  2. Choose a computer language.
  3. Choose a program invocation mechanism.
  4. Choose a Web server program to support the first three choices.
You've made the first three choices. Now you have to look around for Web server software that will support them. If you've settled on CGI as a program invocation mechanism, then it won't really matter which server you use (server independence being the main point of CGI, after all). If you want to use a server's API, then you need to find a server program that supports the language and development style that you've chosen in steps 1 and 2.

Example 1: Redirect

100th Anniversary Boston Marathon (1996). When my friend Brian and I were young and stupid, we installed the NCSA 1.3 Web server program on our research group's file server, martigny.ai.mit.edu. We didn't bother to make an alias for the machine like "www.brian-and-philip.org" so the URLs we distributed looked like "http://martigny.ai.mit.edu/samantha/".

Sometime in mid-1994 the researchers depending on Martigny, whose load average had soared from 0.2 to 3.5, decided that a 100,000 hit per day Web site was something that might very nicely be hosted elsewhere. It was easy enough to find a neglected HP Unix box, which we called swissnet.ai.mit.edu. And we sort of learned our lesson and did not distribute this new name in the URL but rather aliases: "www-swiss.ai.mit.edu" for research publications of our group (known as "Switzerland" for obscure reasons); "webtravel.org" for my travel stuff; "photo.net" for my photo stuff; "pgp.ai.mit.edu" for Brian's public key server; "samantha.rules-the.net" for fun.

But what were we to do with all the hard-wired links out there to martigny.ai.mit.edu? We left NCSA 1.3 loaded on Martigny but changed the configuration files so that a request for "http://martigny.ai.mit.edu/foo/bar.html" would result in a 302 redirect being returned to the user's browser so that it would instead fetch http://www-swiss.ai.mit.edu/foo/bar.html.

Two years later, in August 1996, someone upgraded Martigny from HP-UX 9 to HP-UX 10. Nobody bothered to install a Web server on the machine. People began to tell me "I searched for you on the Web but your server has been down since last Thursday." Eventually I figured out that the search engines were still sending people to Martigny, a machine that was in no danger of ever responding to a Web request since it no longer ran any program listening to port 80.

The one store on Chappaquiddick, Martha's Vineyard, Massachusetts.  It is a combination beer/ice shop and junkyard Rather than try to dig up a copy of NCSA 1.3, I decided it was time to get some experience with Apache, the world's most popular Web server. I couldn't get the 1.2 beta sources to compile. So I said, "This free software stuff is for the birds; I need the heavy duty iron." I installed the 80MB Netscape Enterprise Server and sat down with the frames- and JavaScript-heavy administration server. After 15 minutes, I'd configured the port 80 server to redirect. There was only one problem: It didn't work.

I spent a day going back and forth with Netscape tech support. "Yes, the Enterprise server definitely could do this. Probably it wasn't configured properly. Could you e-mail us the obj.conf file? Hmmm . . . it appears that your obj.conf file is correctly specifying the redirect. There seems to be a bug in the server program. You can work around this by defining custom error message .html files with Refresh: tags so that users will get popped over to the new server if they are running a Netscape browser."

I pointed out that this would redirect everyone to the swissnet server root, whereas I wanted "/foo/bar.html" on Martigny to redirect to "/foo/bar.html" on Swissnet.

"Oh."

They never got back to me.

Porsche and corn.  Pennsylvania. I finally installed AOLserver which doesn't have a neat redirect facility, but I figured that the Tcl API was flexible enough that I could make the server do what I wanted.

First, I had to tell AOLserver to feed all requests to my Tcl procedure instead of looking around in the file system:

ns_register_proc GET / martigny_redirect

This is a Tcl function call. The function being called is named ns_register_proc. Any function that begins with "ns_" is part of the NaviServer Tcl API (NaviServer was the name of the program before AOL bought NaviSoft in 1995). ns_register_proc takes three arguments: method, URL, and procname. In this case, I'm saying that HTTP GETs for the URL "/" (and below) are to be handled by the Tcl procedure martigny_redirect:


proc martigny_redirect {} {
    append url_on_swissnet "http://www-swiss.ai.mit.edu" [ns_conn url]
    ns_returnredirect $url_on_swissnet
}

This is a Tcl procedure definition, which has the form "proc procedure-name arguments body". martigny_redirect takes no arguments. When martigny_redirect is invoked, it first computes the full URL of the corresponding file on Swissnet. The meat of this computation is a call to the API procedure "ns_conn" asking for the URL that was part of the request line.

With the full URL computed, martigny_redirect's second body line calls the API procedure ns_returnredirect. This writes back to the connection a set of 302 redirect headers instructing the browser to rerequest the file, this time from "http://www-swiss.ai.mit.edu".

Here's what I learned from this experience:

Example 2: Customizing Access

Some friends of mine at MIT Press wanted to sell subscriptions to electronic journals either to institutions or to individuals. They also wanted portions of the journals to be freely available. In the case of an institutional subscriber, the server needed to recognize that the client came from a range of authorized IP addresses, e.g., any computer whose IP address starts with "36." is at Stanford, so if they've paid for a site-wide subscription, don't demand username or password from individuals. For individuals, they decided to start by simply distributing the same username/password pair to all the subscribers. All of the information about who was authorized had to come from their relational database. It turned out that this set of constraints was too complex for the standard permissions module that comes with AOLserver, if only because it uses its own little Unix files-based database. Fortunately, the AOLserver API provides for program invocation prior to page service via a mechanism called filters. After 20 minutes we came up with the following program: Chappaquiddick, like the rest of Martha's Vineyard (Massachusetts) has only a tiny bit of public beach
Chappaquiddick Beach Club, sort of part of Martha's Vineyard, Massachusetts
# tell AOLserver to watch for PDF file requests under the /ejournal directory
# if we don't add additional ns_register_filter commands, all the 
# other files will be available to everyone
ns_register_filter preauth GET /ejournal/*.pdf ejournal_check_auth

proc ejournal_check_auth {args why} {
    # all the parameters we might want to change
    set user "open"
    set passwd "sesame"
    # on the real-life server, these are pulled from a relational database
    # but here for an example, let's just set it to MIT and Stanford
    set allowed_ip_ranges [list "18.*" "36.*"]

    foreach pattern $allowed_ip_ranges {
	if { [string match $pattern [ns_conn peeraddr]] } {
	    # a paying customer; the file will be sent
	    return "filter_ok"
	}
    }

    # not coming from a special IP address, let's check the 
    # username and password headers that came with the request
    if { [ns_conn authuser] == $user && [ns_conn authpassword] == $passwd } {
	# they are an authorized user; the file will be sent
	return "filter_ok"
    }

    # not a good IP address, no headers, hammer them with a 401 demand
    ns_set put [ns_conn outputheaders] WWW-Authenticate "Basic realm=\"MIT Press:Restricted\""
    ns_returnfile 401 text/html "[ns_info pageroot]ejournal/please-subscribe.html"

    # stop AOLserver from handling the request by returning a special code
    return "filter_return"
}

Example 3: Aid to Evaluating Your Accomplishments (randomizing a page)

Car Wash.  Monterey, California
"For me grad school is fun just like playing Tetris all night is fun. In the morning you realize that it was sort of enjoyable, but it didn't get you anywhere and it left you very very tired."
-- Michael Booth's comment on my "Women in Computing" page
Computer science graduate students earn a monthly stipend that wouldn't hire a good Web/db programmer for an afternoon. If you've been reading Albert Camus lately ("It is a kind of spiritual snobbery to think one can be happy without money") then you'd expect this to lead to occasional depression. For these depressed souls, I published Career Guide for Engineers and Scientists (http://photo.net/philg/careers.html).

I thought that starving graduate students forgoing six years of income would be cheered to read the National Science Foundation report that "Median real earnings remained essentially flat for all major non-academic science and engineering occupations from 1979-1989. This trend was not mirrored among the overall work force where median income for all employed persons with a bachelor's degree or higher rose 27.5 percent from 1979-1989 (to a median salary of $28,000)."

I even did custom photography for the page (see ).

But I didn't think I'd really be able to get under the skin of America's best and brightest young computer scientists until Eve Andersson (the brilliant Caltech Pi Goddess) and I released Aid to Evaluating Your Accomplishments (see ).

Here's the source code:

# a helper procedure to pick N items randomly from a list
# note that it uses tail-recursion, importing a little bit 
# of the clean Scheme philosophy into the ugly world of Tcl

proc choose_n_random {choices_list n_to_choose chosen_list} {
    if { $n_to_choose == 0 } {
	return $chosen_list
    } else {
	set chosen_index [randomRange [llength $choices_list]]
	set new_chosen_list [lappend chosen_list [lindex $choices_list $chosen_index]]
	set new_n_to_choose [expr $n_to_choose - 1]
	set new_choices_list [lreplace $choices_list $chosen_index $chosen_index]
	return [choose_n_random $new_choices_list $new_n_to_choose $new_chosen_list]
    }
} 

# we encapsulate the printing of an individual person so that 
# one day we can easily change the design of the page (we display
# four people at once and putting this in a procedure keeps us from
# having to edit the same code four times).

proc one_person {person} {
    set name [lindex $person 0]
    set title [lindex $person 1]
    set achievement [lindex $person 2]
    return "<h4>$title $name</h4>\n $achievement <br><br> <center> (<a href=\"http://altavista.digital.com/cgi-bin/query?pg=q&what=web&fmt=&q=[ns_urlencode $name]\">more</a>) </center>\n"
}

# we return HTTP headers to the client

ReturnHeaders

# we return as much of the page as we can before figuring out which four
# people we're going to display; this way if we were going to query a 
# relational database (potentially taking 1/2 second), the user would
# have something on-screen to read

ns_write "<html>
<head>
<title>Aid to Evaluating Your Accomplishments</title>
</head>

<body bgcolor=#ffffff text=#000000>
<h2>Aid to Evaluating Your Accomplishments</h2>

part of <a href=\"/philg/careers.html\">Career Guide for Engineers and Scientists</a>


<hr>

Compare yourself to these four ordinary people who were selected at random:

<br>
<br>
"

# each person is name, title, accomplishment(s)

set einstein [list "A. Einstein" "Patent Office Clerk" \
                   "Formulated Theory of Relativity."]

set mill [list "John Stuart Mill" "English Youth" \
               "Was able to read Greek and Latin at age 3."]

set mozart [list "W. A. Mozart" "Viennese Pauper" \
                 "Composed his first opera, <i>La finta
                 semplice</i>, at the age of 12."]

set jesus [list "Jesus of Nazareth" "Judean Carpenter" \
                "Told young women he was God and they believed him."]

set stevens [list "Wallace Stevens" "Hartford Connecticut Insurance Executive" 
                  "Won Pulitzer Prize for Poetry in 1954; best known for
                   \"Thirteen Ways of Looking at a Blackbird\"."]

# ... there are a bunch more in the real live script

set average_folks [list $einstein $mill $mozart $jesus]

# we call our choose_n_random procedure, note that we give it an empty
# list to kick off the tail-recursion

set four_average_folks [choose_n_random $average_folks 4 [list]]

ns_write $conn "<table cellpadding=20>
<tr>
<td valign=top>
[one_person [lindex $four_average_folks 0]]
</td>
<td valign=top>
[one_person [lindex $four_average_folks 1]]
</td>
</tr>
<tr>
<td valign=top>
[one_person [lindex $four_average_folks 2]]
</td>
<td valign=top>
[one_person [lindex $four_average_folks 3]]
</td>
</tr>
</table>
"

# note how in the big block of static HTML below, we're forced to 
# put backslashes in front of the string quotes.  This is annoying 
# and we wouldn't have to do it if we'd implemented this using
# AOLserver Dynamic Pages (where the text is HTML by default, 
# Tcl code by exception).

ns_write $conn "

<p>

Programmed by <a href=\"http://www.ugcs.caltech.edu/~eveander/\">Eve
Astrid Andersson</a> and <a href=\"/philg/\">Philip Greenspun</a> in
<a href=\"/wtr/servers.html#naviserver\">AOLserver Tcl</a>.  If you're
a nerd, you might find <a href=\"four-random-people.txt\">the source
code</a> useful.

<P>

Original Inspiration: <cite>How to Make Yourself Miserable</cite>, by
Dan Greenburg

<hr>
<a href=\"/philg/\"><address>philg@mit.edu</address></a>
</body>
</html>
"


Example 4: Focal Length Calculator (taking data from users)

Alex in front of the Green Building.  Massachusetts Institute of Technology 100th Anniversary Boston Marathon (1996). Back in the 1960s, an IBM engineer had a good idea. Build a smart terminal that could download a form from a mainframe computer. The form would have reserved fields for display only, input fields where the user could type, and blinking fields. After the user had filled out all the input fields, the data would be submitted to the mainframe and acknowledged or rejected if there were any mistakes. This method of interaction was rather frustrating and disorienting for users but made efficient use of the mainframe's precious CPU time. This was the "3270" terminal and hundreds of thousands were sold 20 years ago, mostly to big insurance companies and the like.

The forms user interface model fell into the shade after 1984 when the Macintosh "user drives" pull-down menu system was introduced. However, HTML forms as classically conceived work exactly like the good old 3270. Here's an example that is firmly in the 3270 mold, taken from the Lens chapter of my photography tutorial textbook (http://photo.net/photo/tutorial/lens.html). The basic idea is to help people figure out what size lens they will need to buy or rent in order to make a particular image. They fill in a form with distance to subject and the height of their subject (see ). The server then tells them what focal length lens they need for a 35mm camera.

Here's the HTML source for the form:


<form method=post action=focal-length.tcl>
How far away is your subject?  
<input type=text name=distance_in_feet size=7>  (in feet)
<p>
How high is the object you want to fill the frame?  
<input type=text name=subject_size_in_feet size=7>  (in feet)

<p>

<input type=submit>

</form>

Here's the AOLserver Tcl program that processes the user input:


set_form_variables

# distance_in_feet, subject_size_in_feet are the args from the form
# they are now set in Tcl local variables thanks to the magic 
# utility function call above

# let's do a little IBM mainframe-style error-checking here

if { ![info exists distance_in_feet] || [string compare $distance_in_feet ""] == 0 } {
    ns_return 200 text/plain "Please fill in the \"distance to subject\" field"
    # stop the execution of this script
    return
}

if { ![info exists subject_size_in_feet] || [string compare $subject_size_in_feet ""] == 0 } {
    ns_return 200 text/plain "Please fill in the \"subject size\" field"
    # stop the execution of this script
    return
}

# we presume that subject is to fill a 1.5 inch long-dimension of a
# 35mm negative

# ahhh... the joys of arithmetic in Tcl, a quality language so 
# much cleaner than Lisp

set distance_in_inches [expr $distance_in_feet * 12]
set subject_size_in_inches [expr $subject_size_in_feet * 12]

set magnification [expr 1.5 / $subject_size_in_inches]

set lens_focal_length_inches [expr $distance_in_inches / ((1/$magnification) + 1)]

set lens_focal_length_mm [expr round($lens_focal_length_inches * 25.4)]

# now we return a page to the user, one big string into which we let Tcl
# interpolate some variable values

ns_return $conn 200 text/html "<html>
<head>
<title>You need $lens_focal_length_mm mm </title>
</head>

<body bgcolor=#ffffff text=#000000>
<table>
<tr>
<td>
<a href=\"/photo/pcd0952/boston-marathon-46.tcl\"><img HEIGHT=198 WIDTH=132 src=\"/photo/pcd0952/boston-marathon-46.1.jpg\" ALT=\"100th Anniversary Boston Marathon (1996).\"></a>
<td>


<h2>$lens_focal_length_mm millimeters</h2>

will do the job on a Nikon or Canon or similar 35mm camera

<P>

(according to the <a href=\"/photo/tutorial/lens.html\">photo.net lens tutorial</a> calculator)

</tr>
</table>

<hr>

Here are the raw numbers:

<ul>
<li>distance to your subject:  $distance_in_feet feet ($distance_in_inches inches)
<li>long dimension of your subject:  $subject_size_in_feet feet ($subject_size_in_inches inches)
<li>magnification:  $magnification
<li>lens size required:  $lens_focal_length_inches inches ($lens_focal_length_mm mm)

</ul>

Assumptions: You are using a standard 35mm frame (24x36mm) whose long
dimension is about 1.5 inches.  You are holding the camera in portrait
mode so that your subject is filling the long side of the frame.  You
are supposed to measure subject distance from the optical midpoint of
the lens, which for a normal lens is roughly at the physical midpoint.

<P>

Source of formula:  <a href=\"/photo/dead-trees/professional-photoguide.html\">Kodak 
Professional Photoguide</a>
<br>
Source of server-side programming knowledge:  Chapter 9 of 
<a href=\"http://photo.net/wtr/dead-trees/\">How to be a Web Whore Just Like Me</a>
<br>
Time required to write this program:  15 minutes. 
<br>
Proof that philg is a nerd:  <a href=\"focal-length.txt\">view the source code</a>
<br>

What this is not: a slow Java program that will crash everyone's browser (except those behind corporate firewalls that block all Java applets)

<br>

Another thing this is not:  a CGI program that will make my poor old Unix box fork

<br>

Yet another thing this is not: a JavaScript program that you'd think
would be the right thing but then on the other hand it wouldn't work with some browsers and the last thing that I need is email from confused users


<h3>Bored?  Try again</h3>

<form method=post action=focal-length.tcl>
How far away is your subject?  
<input type=text name=distance_in_feet size=7 value=\"$distance_in_feet\">  (in feet)
<p>
How high is the object you want to fill the frame?  
<input type=text name=subject_size_in_feet size=7 value=\"$subject_size_in_feet\">  (in feet)

<p>

<input type=submit>

</form>

<h3>European?  Macro-oriented?</h3>

<form method=post action=focal-length-mm.tcl>
How far away is your subject?  
<input type=text name=distance_in_mm size=7>  (in millimeters)
<p>
How high is the object you want to fill the frame?  
<input type=text name=subject_size_in_mm size=7>  (in millimeters)

<p>

<input type=submit>

</form>


<hr>
<a href=\"/philg/\"><address>philg@mit.edu</address></a>
</body>
</html>"


Example 5: Bill Gates Personal Wealth Clock (taking data from foreign servers)

Academic computer scientists are the smartest people in the world. There are an average of 800 applications for every job. And every one of those applicants has a PhD. Anyone who has triumphed over 799 PhDs in a meritocratic selection process can be pretty sure that he or she is a genius. Publishing is the most important thing in academics. Distributing one's brilliant ideas to the adoring masses. The top computer science universities have all been connected by the Internet or ARPAnet since 1970. A researcher at MIT in 1975 could send a technical paper to all of his or her interested colleagues in a matter of minutes. With this kind of heritage, it is natural that the preferred publishing medium of 1990s computer science academics is . . . dead trees.

Yes, dead trees.

If you aren't in a refereed journal or conference, you aren't going to get tenure. You can't expect to achieve quality without peer review. And peer review isn't just a positive feedback mechanism to enshrine mediocrity. It keeps uninteresting papers from distracting serious thinkers at important conferences. For example, there was this guy in a physics lab in Switzerland, Tim Berners-Lee. And he wrote a paper about distributing hypertext documents over the Internet. Something he called "the Web". Fortunately for the integrity of academia, this paper was rejected from conferences where people were discussing truly serious hypertext systems.

Anyway, with foresight like this, it is only natural that academics like to throw stones at successful unworthies in the commercial arena. IBM and their mainframe customers provided fat targets for many years. True, IBM research labs had made many fundamental advances in computer science, but it seemed to take at least 10 years for these advances to filter into products. What kinds of losers would sell and buy software technology that was a decade behind the state of the art?

Then Bill Gates came along with technology that was 30 years behind the state of the art. And even more people were buying it. IBM was a faceless impediment to progress but Bill Gates gave bloated monopoly a name, a face, and a smell. And he didn't have a research lab cranking out innovations. And every non-geek friend who opened a newspaper would ask, "If you are such a computer genius, why aren't you rich like this Gates fellow?"

Naturally I maintained a substantial "Why Bill Gates is Richer than You" section on my site but it didn't come into its own until the day my friend Brian showed me that the U.S. Census Bureau had put up a real-time population clock at http://www.census.gov/cgi-bin/popclock. There had been stock quote servers on the Web almost since Day 1. How hard could it be to write a program that would reach out into the Web and grab the Microsoft stock price and the population, then do the math to come up with what you see at http://www.webho.com/WealthClock (see ).

This program was easy to write because the AOLserver Tcl API contains the ns_httpget procedure. Having my server grab a page from the Census Bureau is as easy as

ns_httpget "http://www.census.gov/cgi-bin/popclock"

Tcl the language made life easy because of its built-in regular expression matcher. The Census Bureau and the Security APL stock quote folks did not intend for their pages to be machine-parsable. Yet I don't need a long program to pull the numbers that I want out of a page designed for reading by humans.

Tcl the language made life hard because of its deficient arithmetic. Some computer languages-Pascal, for example-are strongly typed. You have to decide when you write the program whether a variable will be a floating-point number, a complex number, or a string. Lisp is weakly typed. You can write a mathematical algorithm with hundreds of variables and never specify their types. If the input is a bunch of integers, the output will be integers and rational numbers (ratios of integers). If the input is a complex double precision floating-point number, then the output will be complex double precision. The type is determined at run-time. I like to call Tcl "whimsically" typed. The type of a variable is never really determined. It can be a number or a string. It depends on the context. If you are looking for a pattern, "29" is a string. If you are adding it to another number, "29" is a decimal number. But "029" is an octal number so trying to add it to another number results in an error.

Anyway, here is the code. Look at the comments.

# this program copyright 1996, 1997 Philip Greenspun (philg@mit.edu)
# redistribution and reuse permitted under
# the standard GNU license
# this function turns "99 1/8" into "99.125"
proc wealth_RawQuoteToDecimal {raw_quote} {
    if { [regexp {(.*) (.*)} $raw_quote match whole fraction] } {
 # there was a space
 if { [regexp {(.*)/(.*)} $fraction match num denom] } {
     # there was a "/"
     set extra [expr double($num) / $denom]
     return [expr $whole + $extra]
 }
 # we couldn't parse the fraction
 return $whole
    } else {
 # we couldn't find a space, assume integer
 return $raw_quote
    }
}
###
#   done defining helpers, here's the meat of the page
###
# grab the stock quote and stuff it into QUOTE_HTML
set quote_html [ns_httpget "http://qs.secapl.com/cgi-bin/qs?ticks=MSFT"]

# regexp into the returned page to get the raw_quote out
regexp {Last Traded at</a></td><td align=right><strong>([^A-z]*)</strong>} \
       $quote_html match raw_quote

# convert whole number + fraction, e.g., "99 1/8" into decimal,
# e.g., "99.125"
set msft_stock_price [wealth_RawQuoteToDecimal $raw_quote]
set population_html [ns_httpget "http://www.census.gov/cgi-bin/popclock"]

# we have to find the population in the HTML and then split it up
# by taking out the commas
regexp {<H1>[^0-9]*([0-9]+),([0-9]+),([0-9]+).*</H1>} \
       $population_html match millions thousands units

# we have to trim the leading zeros because Tcl has such a
# brain damaged model of numbers and thinks "039" is octal
# this is when you kick yourself for not using Common Lisp
set trimmed_millions [string trimleft $millions 0]
set trimmed_thousands [string trimleft $thousands 0]
set trimmed_units [string trimleft $units 0]

# then we add them back together for computation
set population [expr ($trimmed_millions * 1000000) + \
                     ($trimmed_thousands * 1000) + \
                     $trimmed_units]

# and reassemble them in a string for display
set pretty_population "$millions,$thousands,$units"

# Tcl is NOT Lisp and therefore if the stock price and shares are
# both integers, you get silent overflow (because the result is too
# large to represent in a 32 bit integer) and Bill Gates comes out as a
# pauper (< $1 billion). We hammer the problem by converting to double
# precision floating point right here.
#
# (Were we using Common Lisp, the result of multiplying two big 32-bit
# integers would be a "big num", an integer represented with multiple
# words of memory; Common Lisp programs perform arithmetic correctly.
# The time taken to compute a result may change when you move from a
# 32-bit to a 64-bit computer but the result itself won't change.)
set gates_shares_pre_split [expr double(141159990)]
set gates_shares [expr $gates_shares_pre_split * 2]
set gates_wealth [expr $gates_shares * $msft_stock_price]
set gates_wealth_billions \
    [string trim [format "%10.6f" [expr $gates_wealth / 1.0e9]]]
set personal_share [expr $gates_wealth / $population]
set pretty_date [exec /usr/local/bin/date]

# we're done figuring, now let's return a page to the user
ns_return 200 text/html "<html>
<head>
<title>Bill Gates Personal Wealth Clock</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h2>Bill Gates Personal Wealth Clock</h2>
just a small portion of 
<a href=\"http://www-swiss.ai.mit.edu/philg/humor/bill-gates.html\">Why Bill Gates is Richer than You
</a>
by
<a href=\"http://www-swiss.ai.mit.edu/philg/\">Philip Greenspun</a>
<hr>
<center>
<br>
<br>
<table>
<tr><th colspan=2 align=center>$pretty_date</th></tr>
<tr><td>Microsoft Stock Price:
    <td align=right> \$$msft_stock_price
<tr><td>Bill Gates's Wealth:
    <td align=right> \$$gates_wealth_billions billion
<tr><td>U.S. Population:
    <td align=right> $pretty_population
<tr><td><font size=+1><b>Your Personal Contribution:</b></font>
    <td align=right>  <font size=+1><b>\$$personal_share</font></b>
</table>
<p>
<blockquote>
\"If you want to know what God thinks about money, just look at the
 people He gives it to.\" <br> -- Old Irish Saying
</blockquote>
</center>
<hr>
<a href=\"http://photo.net/philg/\"><address>philg@mit.edu</address>
</a>
</body>
</html>
"

So is this the real code that sits behind http://www.webho.com/WealthClock?

Actually, no. You'll find the real source code linked from the above URL.

Why the differences? I was concerned that, if it became popular, the Wealth Clock might impose an unreasonable load on the subsidiary sites. It seemed like bad netiquette for me to write a program that would hammer the Census Bureau and Security APL several times a second for the same data. It also seemed to me that users shouldn't have to wait for the two subsidiary pages to be fetched if they didn't need up-to-the-minute data.

So I wrote a general purpose caching facility that can cache the results of any Tcl function call as a Tcl global variable. This means that the result is stored in the AOLserver's virtual memory space and can be accessed much faster even than a static file. Users who want a real-time answer can demand one with an extra mouse click. The calculation performed for them then updates the cache for casual users.

Does this sound like overengineering? It didn't seem that way when Netscape put the Wealth Clock on their What's New page for two weeks (summer 1996). The URL was getting two hits per second. Per second. And all of those users got an instant response. The extra load on my Web server was not noticeable. Meanwhile, all the other sites on Netscape's list were unusably slow. Popularity had killed them.

Here are the lessons that I learned from this example:

Example 6: AOLserver Dynamic Pages

As long as we're on the subject of Bill Gates, it is worth demonstrating the syntax and style that his company inspired with its Active Server Pages. The folks at America Online fell in love with this idea but not with the reliability of NT or IIS. Thus they added a similar facility to AOLserver called AOLserver Dynamic Pages (ADP), which I used to build the WimpyPoint system, described in Chapter 1.

See for the WimpyPoint page that offers public presentations to casual surfers. The idea is that someone will come to the site, look for the name of the author, then click down to find the presentation of interest.

Here's the ADP source code:

<% wimpy_header "Choose Author" %>

<h2>Choose an Author</h2>

in <a href="/"><%=[wimpy_system_name]%></a>

<hr>

Here's a list of users who have public presentations:

<ul>

<%

set db [ns_db gethandle]
set selection [ns_db select $db "select distinct u.user_id, u.last_name, u.first_names,  u.email
from wimpy_users u, wimpy_presentation_ownership wpo, wimpy_presentations wp
where u.user_id = wpo.user_id
and wpo.presentation_id = wp.presentation_id
and wp.public_p = 't'
order by upper(u.last_name), upper(u.first_names)"]

while { [ns_db getrow $db $selection] } {
    set_variables_after_query
    ns_puts "<li><a href=\"user-top.adp?user_id=$user_id\">$last_name, $first_names ($email)</a>\n"
}

%>

</ul>

Or you can do a full-text search through all the slides:

<form method=GET action="search.adp"> 
Query String:  <input type=text name=query_string size=50>
<input type=submit value="Submit">
</form>

<% wimpy_footer %>
Note that I'm allowed to use arbitrary HTML, including string quotes, at the top level of the file. Note further that there are two escapes to the ADP evaluator. The basic escape is <%, which will execute a bunch of Tcl code for effect. If the Tcl code wants to write some bytes to the browser, it has to call ns_puts. The second escape sequence is <%=, which will execute a bunch of Tcl code and then write the result out to the browser. Generally I use the <%= style when I want to do something simple, e.g., include the system name that I grab from the Tcl procedure wimpy_system_name. I use the <% style when I want to execute a sequence of Tcl procedures to query the database, etc.

Example 7: Active Server Pages

I haven't personally written any Microsoft Active Server Pages. Fortunately, Microsoft set up NT/IIS/ASP such that if you were curious to see the source code behind http://foobar.com/yow.asp, you had only to type "http://foobar.com/yow.asp." (note the trailing period) into your Netscape and the foreign server would deliver the source code right to your desktop. This was a great convenience for people trying to learn ASP; however, it presented something of a security problem for Web publishers, because they would often have their database or system administration passwords in the source code. It seems that Microsoft's intention was not to make public all of its customers' source code and hence they eventually released a security patch to change this behavior. However, a few months later people learned that requesting "http://foobar.com/yow.asp::$DATA" (note the trailing "::$DATA") would also get them the source code.

Anyway, thanks to Microsoft's sloppiness, in just a couple of hours of surfing one night in July 1998, I managed to accumulate a nice collection of ASP examples at http://arsdigita.com/books/panda/aspharvest/. Note that I did my surfing some time after the bug had become common knowledge yet companies such as DIGITAL, Arthur Andersen, and banks had not patched their servers.

I find firewall.asp amusing because it is DIGITAL's advertisement for their network security products. Similarly I like the fact that GAP Instrument Corp. took the trouble to warn users

You have reached a computer system providing United States government information. Unauthorized access is prohibited by Public Law 99-474, (The Computer Fraud and Abuse Act of 1986) and can result in administrative, disciplinary or criminal proceedings.
(the very first link from http://net.gap.net and all the other pages on their Web sites) yet had left their ASP pages wide open.

CompuServe gives us a nice simple example with Conf.asp. The goal of the script is to first figure out whether the person browsing is a CompuServe member or not and then serve one of two entirely separate HTML pages. An if statement is thus opened inside one <% %> and closed in another:

<!--#INCLUDE VIRTUAL="/Forums/member.inc"-->
<% if member = 1 then %>
<HTML>
<HEAD>
<TITLE>TW Crime Forum</TITLE>
</HEAD>
<BODY BGCOLOR=#FFFFFF>

... ** a page for members *** ..

</BODY>
</HTML>

<BR><I>We Update the Forum Directory Weekly.  The directory was last updated: Thursday, January 08, 1998</I>
...
</BODY>
</HTML>

<% else %>
<HTML>
<HEAD>
<TITLE>TW Crime Forum</TITLE>
</HEAD>
<BODY BGCOLOR=#FFFFFF>

... ** a page for non-members ***

</BODY>
</HTML>
<%End If%>
An interesting thing to note about this page is that CompuServe hasn't run their HTML through a syntax checker, which would no doubt have complained about the stuff after the </HTML> (I've highlighted the extraneous text in bold, above).

Let's move on to some db-backed pages.

The folks who built Fulton Bank's site (www.fulton.com) are very enthusiastic about Microsoft:

"The hottest technology to hit the Internet which is actually useable now is Active Server Page scripting. This has given us a number of advantages over the ancient art of CGI. ... Intranets and Extranets where the variety of user machine platforms, processors, etc are an issue ASP can play in nicely."
-- http://coolnew.xspot.com/what_we_use.asp
Let's see how ASP works for them in process_product.asp, a script that takes a query string and tries to find banking products that match this query string.
<% affcode = 1057 %>

<HTML>
<HEAD>
<TITLE>Fulton Bank</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF">

<BLOCKQUOTE>
<TABLE WIDTH=370 ALIGN="middle">
<TR>
<TD>
<BR>
<IMG SRC="images/header_products.gif"><BR>
<BR>
<BR>

<% 
   Set Conn=Server.CreateObject("ADODB.Connection")
   Conn.Open "FultonAffiliates"

   SQL = "SELECT * 
FROM products 
WHERE productname 
LIKE '%" & Request.Form("product") & "%' 
AND affiliate = '" & affcode & "'"

   Set RS = Conn.Execute(SQL)
%>

<TABLE>

<% if RS.EOF then %>
<TR><TD>Sorry No Products Found</TD></TR>
<% end if %>

<% DO UNTIL RS.EOF %>
<TR>
<TD VALIGN="top"><IMG SRC="images/diamond3.gif"></TD>
<TD>
<A HREF="<% = RS("url") %>"><FONT COLOR="blue"><% = RS("productname") %></FONT></A><BR>
<% = RS("shortdesc") %><BR>
<BR> <BR>
</TD>
</TR>
<% RS.MoveNext %>
<% LOOP %>
</TABLE></BLOCKQUOTE>
</TD>
</TR>
</TABLE>
<% rs.close
   conn.close
%>
<!--#include file="footer.asp"-->
</BODY>
</HTML>
This is some pretty clean code. The programmers have encapsulated the database password in their ODBC connection configuration. Also, rather than just bury the magic number "1057" in the code, they set affcode to it as the very first line of the program. Finally, they've parked the page footer in a centralized footer.asp file that gets included by all of their scripts.

What you should have learned from this section is that, if you're going to use Microsoft server tools, you shouldn't take any programming shortcuts, leave the database or Administrator password in the code, or put any naughty words into comments. When the next NT/IIS/ASP bug is discovered and your source code becomes public, you want people to admire your work!

If you're thinking that ASP sounds like a better-than-average idea from Microsoft, you won't be surprised to learn that it wasn't their idea. They dipped into some of their desktop monopoly profits to acquire the small company that developed ASP. As I wrote this, I tried to surf over to http://www.microsoft.com/iis/ to see if they credit the programmers who developed ASP, but the Microsoft server farm was taking 45 seconds to deliver each page. So I gave up.

Summary

Server-side programming is straightforward and can be done in almost any computer language, including extended versions of HTML itself. However, making the wrong technology decisions can result in a site that requires ten times the computer hardware to support. Bad programming can also result in a site that becomes unusable as soon as you've gotten precious publicity. Finally, the most expensive asset you are developing on your Web server is content. It is worth thinking about whether your server-side programming language helps you get the most out of your investment in content.

More



or move on to Chapter 11: Sites that are really databases

philg@mit.edu

Reader's Comments

Please take some time to investigate and properly flesh out this section with some information about Java server side programming. I find the Servlet API an absolute MUST in my work. When combined with a top-notch JVM (a la IBM), on a proper foundation such as Linux with the Apache web server and servlet engine, it has proven to bring me completely out of the dark ages of thick client GUI programming.

The network centric world is here and I would venture that one could not find a more network savvy programming language. There are careful choices to be made upon entering the Java arena, but they are easy, obvious choices and the benefits of making the committment are simply joyous.

-- Mitch Winkle, November 3, 1999

You can avoid having to escape double-quote marks with backslashes by using single-quote marks instead. This is legal HTML as documented in a page at the W3C titled On SGML and HTML where it says:

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa.

I have used this technique many times over the past five years or so and I have never seen a failure. Pages done this way also pass the tests at validator.w3.org so I am pretty comfortable with it.



-- Peter Holt Hoffman, December 23, 1999
One of the most popular server-side languages is PHP, with over one million web servers supporting it as of November 1999. Their home page is at http://www.php.net

-- Marc Delisle, February 18, 2000
Please, please, please don't recommend The CGI/Perl Cookbook or anything else that might even allege that the code in Matt's Script Archive might even possibly be the right way to do things.

I worked for a medium-sized ISP for a number of years. Part of my job there was providing support for users who were attempting to integrate dynamic content with their personal websites. Most of them were using pre-packaged CGI, much of which was by either Matt Wright or Selena Sol. These scripts were a support nightmare, and in some cases (Matt's search.pl) brougt out poor webserver to its knees. Some of the more interesting bits include calling 'grep' from within a Perl script (WTF??), parsing GET/POST data manually (instead of using the CGI library that's been included with the standard Perl distribution for a couple of years now), and doing keyword searches "The Hard Way" (ie. opening up a file in the web directory, slupring the whole thing into memory, applying regexps to it, and moving on to the next file. This for each and every query!)

Note that we provided our users with a competent, indexing full-text search engine, and even went so far as to write our own when that one wouldn't scale. Still, our users persisted in installing this evil onto the webserver. For a while, every few weeks, we were hunting-and-killing instances of Matt's search.pl installed by our users. It was awful.

For a similar persepctive, just check out what happens when anyone mentions the name "Matt Wright" anywhere in comp.lang.perl.* Eek.

Now, I have to admit that I haven't read The CGI/Perl Cookbook. It's possible that Mr. Wright has learned a bit about what good code looks like, but if that's the case, the contents of Matt's Script Archive don't seem to reflect it. It was really nice of him to try to share his knowlege and code with the newbies of the world, but it's possible that he's inadvertently caused more harm than good.

For a beginning Perl programmer, Learning Perl, from O'Reilly & Assoc. (the Llama Book) is probably still the best thing out there. I haven't read any of the CGI specific Perl books, but there has to be something better.

This is such a wonderful, informative book... Please don't lead you readers astray! :)

-- Ian Baker, March 15, 2000

I've been using JavaServer Pages ( JSP ) for a while now and think that they definitely merit consideration.

-- George Harley, March 16, 2000
As an addendum to Example 7... a new Microsoft bug takes over where the other two left off. You can view the source of ASP pages on some servers again.
   > SECURITY LEAD STORY:
   > IS WEBHITS.DLL REVEALING YOUR SOURCE?
   > 
   > Imagine the URL of a typical ASP site:
   > http://www.yoursite.com/yourfile.asp
   > 
   > Now try this variation:
   > http://www.yoursite.com/null.htw?CiWebHitsFile=/yourfile.asp%20&CiRestriction=none&CiHiliteType=Full
   > 
   > If you see your source code, you have the webhits.dll bug!
   > 
   > Microsoft's Fix is at:
   > http://www.microsoft.com/technet/security/bulletin/ms00-006.asp
You can use that information to learn ASP by example, trash Microsoft products, help a friend who's server is on NT, or view funny and/or disparaging comments by sloppy consultants. Good luck.

-- Rob Duarte, April 3, 2000
The string handling capabilities of Perl are much superior to Tcl. Tcl doesn't even have the concept of a "here document".

It is silly to compare the time it takes to start an application like MS Word on a PC with the time it takes a UNIX system to fork and run a CGI script. The time it takes for a UNIX system to fork and run a CGI script, is soo small that it is not perceivable by a human being. Every time you type "ls" on the command line on a UNIX system, it forks the shell and execs the "ls" program. UNIX systems are optimized for doing this and running several different interpreter programs like Perl simultaneously.

I am tired of hearing this argument against CGI programming. UNIX systems start new programs very quickly. CGI programs are not as large and slow to start as PC applications. It's not a good reason not to do CGI programming.

In addition most people who use this argument against CGI also provide CGI interfaces to run their products. Meta-HTML uses CGI to interface to web servers.

Having to open a new database connection on every hit is a good reason not to do CGI programming.

Also, with traditional CGI programming you are embedding HTML in a program, which is bad because then you have programmers doing design (HTML) work. Most programmers have no clue about graphic arts. You have to use some method like JSP, ASP, or templates so that you can have a real graphic artist do the design work and then your programmers can put the code in the HTML, because that is something that a programmer can handle. A graphic artist is not likely to be able to understand how to put HTML into a program. Having a programmer do the graphic art, design work leads to really bland, visually boring sites like this one.



-- Bill Chatfield, May 3, 2000

Peter Holt Hoffman wrote:

You can avoid having to escape double-quote marks with backslashes by using single-quote marks instead ...

Not quite true. In some cases you *have* to use double-quotes. Consider the following piece of code:

<input type='hidden' name='Last_Name' value='<%=$Last_Name%>'>

Resulting HTML is perfectly valid as long as Last_Name variable contains string like "foobar". As soon as you put "D'Andrea" to Last_Name variable, you get:

<input type='hidden' name='Last_Name' value='D'Andrea'>

This becomes a problem that can be simply resolved by using double-quotes for the value.

<input type='hidden' name='Last_Name' value="'<%=$Last_Name%>">

Again, you have to be careful if the string contains double-quote character.



-- Nemanja Stanarevic, July 25, 2000
You'd be better off saying this:
<input type='hidden' name='Last_Name'
  value='<%=[ns_quotehtml $Last_Name]%>'>
Then you don't have to worry about single or double quotes in $Last_Name, and you can use single quotes in your HTML.

-- Rob Mayoff, July 25, 2000

In step 4, example 1 you talk about 302 redirects. These are, according to RFC 2616 (HTTP/1.1), only temporary redirections. Wouldn't it have been better to use a 301 redirection, which is "permanent"?

Unless I'm mistaken, search engines will update their links when they encounter a 301 redirect, whereas 302 redirects do not result in such an update. Does anyone know this more accurately?



-- Tomi Junnila, September 19, 2000
Hi, Phil, I'm here at ArsDigita Bootcamp, and the exercise I am doing right now is based on the code on this page for Bill Gates Wealth Clock. In order for this code to work now, because of changes on your server and on the population server you reference, the regexp on line 39 should call for <H2>s instead of <H1>s, and the pretty_date code should come out. Just so each person doesn't have to debug it when they do the exercise. Thanks for everything!

-- Sunah Cherwin, January 16, 2001
" ... my unix box doesn't like to fork 500,000 times a day ..."

That's a fork every .172 seconds. A 166mhz pc forks slightly under 1,000 times/second, or every .001 seconds.

-- Evan Schaffer, March 6, 2001

Well in short: THERE ARE SOME UNTRUE STATEMENTS ABOUT ASP.

-- Aurelian POPA, May 13, 2001
Doesn't Meta-HTML resemble Lisp?

-- Andrei Popov, June 15, 2001
Yes, it should be 301 Redirect ("moved permanently"), not 302 ("moved temporarily").

Search engines do follow this convention. On my site, I set up a 302 redirect to somebody else's site from a maked-up URL that it never occupied, just to illustrate how redirect works. I also listed this maked-up URL as hyperlink on one of my pages. After a while, a search for the title of that site on Google would return my URL as the first hit, accompanied with title and content from the site that my URL was redirecting to. There was no separate search result for site's own URL.

I changed redirect code from 302 to 301. Google apparently re-visited my URL several weeks later, and now a search for the title of that site turns up URL of that site, as it should be. My redirected URL is no longer visible anywhere in the search results.

I'm not sure if you can "hijack" the top listing from somebody this way. It's likely that the site I wrote about was not indexed by Google before (it had a brand new domain name) and was found first through my redirect.

-- Vadim Makarov, August 10, 2001

This page is slanted toward publishing sites, and it's really showing its age to boot.

Back in early 1997 my colleagues and I were looking for something to replace Perl CGIs for publishing/community sites (kinda like photo.net), doing some of the same analysis that Phil has done here. Perl is a fun language and quite capable at system scripting and complex text processing, and I still use it today for various things. However, it's really hard to maintain a medium (>5K lines of Perl code) or large code base written in Perl, and CGI is dreadfully slow. The clever Apache module "mod_perl" that keeps Perl in-process (avoiding forking) either wasn't available or didn't work, FastCGI broke too often, and so we looked at other languages. ASP was very easy to learn but it was Microsoft-only (at the time) and VBScript is a really weak language even for scripting languages. We used Visual J++ to create COM objects written in Java which did the heavy lifting and used JScript in ASP to do the page stuff. That worked OK except for the usual NT/IIS stability problems. We used that for a few sites.

We also evaluated Netscape Livewire, which is compiled JavaScript in HTML pages, with some built-in objects written in C such as database connectors with connection pooling, local filesystem utility objects, an SMTP mail connector, etc. It was horrible - it was fast, but extremely buggy, both at a page programming level (stuff crashed or didn't work) and at a server level (configuration changes didn't stick or didn't work, the server would just hang on restart sometimes).

We didn't try AOLServer because we decided TCL was too weak of a scripting language and had doubts about the product's future. Perhaps we should have tried it but we didn't need to because...

We tried Java Servlets very early on and found that they worked, although the runtime "servlet containers" were extremely immature, and database drivers were hard to find. Compilation wasn't a big deal in those days but it was definitely slower than edit-save-reload. On the other hand performance, and more importantly scalability, was fantastic compared to CGI, since servlet containers are multithreaded and since we wrote a simple database connection pool early on. We were also starting to build sites that had light e-commerce functionality so we needed a language that would allow us to write fairly complicated code without it getting out of hand.

Java really paid off in this respect, in a way that easy scripting languages won't. Even if a scripting language has the ability to let you talk to components/objects written in a "real programming language", that can be a major pain in the rear for the "real programmer" if the scripting environment doesn't handle the data type mapping between the two languages. With Java this was not an issue but we did still have to deal with the HTML-in-source-code issue. I dealt with this the same way I did with Perl - by implementing a trivial template system which used fake tag substitution. This worked OK but restarting the servlet container to show code changes was still a problem (templates reloaded automatically). After far too long and too many proprietary competiting technologies got a foothold, Sun released the Java Server Pages specification, based on ASP. In the JSP architecture there was still a compilation step but you didn't see it because the JSP/servlet container did it for you the first time you reloaded the page after changing things. You still have to restart the servlet container to see changes but at least the templates are standardized and it's somebody else's problem to code and debug the template system.

Maybe this all could have been done in AOLServer but we never went down that road because Java worked well for us. Unlike all the other stuff we had tried, Java worked the way we expected it to, and didn't bog down terribly crash when we wrote a load-testing tool and aimed it at our web sites. So we stopped looking.

I've been working on a small, silly web site project that is mainly an application (as opposed to a document) but isn't terribly complex, and which talks to a database, and I've been using PHP to prototype it. The site has a WAP interface (it makes sense, I promise) and figuring out WML using a scripting language has made it a much less painful experience as I've struggled with getting the WML code just right so that the minibrowser won't barf. I'd hate to think what it would have been like with a Real Programming Language, although JSP wouldn't have been too bad since there's no (visible) compile cycle involved. It took me a few hours to figure out the right command-line incantation to get Apache and PHP to compile correctly and to link to the native Oracle client library, but that's part of the joy of using open-souce C programs: you have to read the documentation and fiddle a bit until you get it right. Still, there are a lot of useful text and HTML functions; the database access is pretty snappy, and this may be the appropriate heir to the niche that AOLServer flourishes in. I'll probably do 50% of the site in PHP and then write the tough parts in Java, then decide whether I should replace the PHP stuff with Java or just leave it as a hybrid.

One other thing about Java, which applies to some other languages but not to most scripting languages: its error handling (via a language feature called Exceptions) is fantastic. This is one place where most scripting languages fall down, although I should note that some of them have grafted it on as an afterthought. C doesn't even have exceptions, although C++ does. Exceptions are basically a way to signal an error condition by stopping execution of the current block of code and exiting with a value that isn't necessarily of the same datatype as the expected return value, but is an object that may contain info about the error. Java has had exceptions from the beginning, and the Java class library uses exceptions all over the place, so it's part of the zen of Java that you use exceptions. Perl has exceptions but I've seen a lot of Perl code (from CPAN, from various commercial software vendors, etc.) and I've never seen them used. If you lack exceptions then you have to write ugly functions that are actually procedures which may return an error code, with "return arguments" (some of the things you supply as parameters are actually placeholders for return values). That means you end up with calling code that looks like:

err = do_stuff(a, &b);         // a is input, b is output
if (NO_ERROR == err) {         // did do_stuff work?
   err = do_more_stuff(b,&c)); // b is input, c is output
   if (NO_ERROR == err) {
      printf("yay, c is %s", c);
   }
   else {
      printf("do_more_stuff failed with an error code of %d", err);
   }
}
else {
   printf("do_stuff failed with an error code of %d", err);
}
which is a royal pain, so lazy programmers tend to just skip thorough error checking in code and let the QA people find the error conditions. With exceptions the above code can look like:
try {
   do_more_stuff(do_stuff(a));
}
catch (int err) { 
   printf("there was an error: %d", err");
}
which is a lot cleaner IMHO. Apply this to a mission-critical app that handles money, order data, etc. (in which all errors must be caught and handled appropriately) that is tens or hundreds of thousands of lines long, and you can see why exceptions are important.

As for Phil's assertions about application servers, I disagree. Application servers have some very advanced functionality that makes sense for very complex back-end applications. A publishing/community site like photo.net doesn't need that stuff; nor does slashdot.org or f**kedcompany.com. That's why these sites just use a scripting language and an RDBMS. Compare that to E*Trade or Orbitz, which are basically big complicated back end systems with a web UI. In these cases the business logic is very complex, and transactions may need to take place across a half dozen systems to process a request. That's why these systems use Java and an application server. However, chances are, almost nobody reading this is building something that big, so chances are you don't need an application server. A JSP/Servlet container is probably fine, and there are several excellent free open source ones (Resin and Tomcat come to mind). Database connection pooling is a must, and either your database driver should include it (if it's modern enough) or you can steal one or code it up yourself in a day.

I also recommend if you're dealing with a lot of forms and complicated DB tables that you look into TopLink. If you're building a complex app, chances are you're not just shuffling strings from forms to SQL statements and then back from query results into HTML. You probably have objects that represent real-world entities in your application (user, customer, product, payment, rating, comment, shipment, etc.) and those don't map exactly to your data model because objects and tables are inherently different representations of the entity. TopLink is a very slick tool/library combo that lets you declaratively define these mappings in a GUI, but it also caches objects (reducing DB access), allows you to write queries for objects and their properties rather than writing SQL select statements, and allows object transactions that can affect multiple databases. It extends its in-memory object transactions and locking down into underlying DB's locking mechanism, meaning that not everything touching your database has to use TopLink and things will still work the way you expect. We were looking for a simple object-relational mapper and were blown away by how sophisticated it is. It is commercial; I think it was $5K per developer seat but there was no charge for the runtime library. It was acquired by WebGain and their online store doesn't have a price so I don't know how much it costs now.

One last note about application servers - from what I've read, nobody likes EJB entity beans, but this doesn't mean J2EE is crap, or that EJB session beans are crap. It just means that EJB entity beans are probably a really dumb way of loading and storing your persistent objects in an RDBMS. Either roll your own (objects that have a "save me" / "load me" method, or a "factory" class that knows how to do queries by object property and return a bunch of matching objects, encapsulating the SQL inside itself), or check out TopLink.

By the way, this is such a long comment that my session timed out, and the first time I tried to submit it, I got a response page containing some SQL and an Oracle error describing a parent key violation because my userid was zero. Oops. This is why catching errors is important... even on community sites. :)

-- Jamie Flournoy, August 21, 2001

Add a comment

Related Links

Add a link