The Gecko NFS Web Proxy

Scott Baker and John H. Hartman

Department of Computer Science
The University of Arizona
Tucson, AZ 85721

Abstract

The World-Wide Web provides remote access to pages using its own naming scheme (URLs), transfer protocol (HTTP), and cache algorithms. Not only does using these special-purpose mechanisms have performance implications, but they make it impossible for standard Unix applications to access the Web. Gecko is a system that provides access to the Web via the NFS protocol. URLs are mapped to Unix file names, providing unmodified applications access to Web pages; pages are transferred from the Gecko server to the clients using NFS instead of HTTP, significantly improving performance; and NFS's cache consistency mechanism ensures that all clients have the same version of a page. Applications access pages as they would Unix files. A client-side proxy translates HTTP requests into file accesses, allowing existing Web applications to use Gecko. Experiments performed on our prototype show that Gecko is able to provide this additional functionality at a performance level that exceeds that of HTTP.

Keywords: NFS, HTTP, Proxy, File System, Hyperlink

1.0 Introduction

The World-Wide Web suffers from performance and interface problems, many of which can be traced to its use of the Hypertext Transfer Protocol (HTTP) to transfer pages between machines. HTTP [5] was originally a very simple protocol intended for use by browsers to access Hypertext Markup Language (HTML) [2] pages stored on Web servers. It was text-based, used a very simple Uniform Resource Locator (URL) naming scheme that exposed the underlying file system on the Web server, transferred only entire pages, and mixed control information and data in the same transfer.

As the popularity of the Web grew, so did the demands on HTTP, causing it to evolve in new and unanticipated directions. The result is a complex protocol that handles proxy servers, redirection, authentication, naming, partial page transfers, multiple page transfers, etc. The protocol is inefficient, difficult to implement, and difficult use. Application programmers who wish to access the Web must contend with HTTP, instead of using the more familiar read and write system calls used to access files.

The Gecko project solves these problems by layering file system functionality on top of HTTP. The goal is to make pages stored on the Web accessible in the same fashion as files stored in a file system, using the same convenient directory-based naming scheme and standard I/O routines such as read, write, and mmap. By making the Web appear to be a file system, Gecko leverages existing file system functionality. Web pages are cached inside the operating system just like files in a network file system, and the issues of cache consistency, replacement, and size are all handled by the operating system. In contrast, HTTP requires that browsers and other application programs do their own caching, perhaps at odds with one another.

The result is a seamless integration of the Web into the operating system. Applications name and access pages on the Web just as they would files, eliminating the need to write custom application code to access the Web. Applications accessing the Web can take full advantages of operating system features such as client-side data caching, cache sharing between applications, server-side data caching, coherency between client caches, and sharing of server-cached pages. Example applications that can benefit from Gecko are:

Gecko is implemented as a Web proxy that provides access to the Web via the Network File System (NFS) [10] protocol, rather than HTTP. The Gecko server uses HTTP to retrieve pages from the Web, but makes those pages accessible to its clients via NFS (Figure 1). Any standard NFS client, such as a Unix workstation, may mount the Gecko server and gain full access to the Web. All of the NFS functionality built into the client, such as file caching, file name caching, cache consistency, etc., operate on Web pages just as they would on NFS files. Clients share pages cached on the Gecko proxy server, and access them using the NFS protocol, instead of the much slower HTTP protocol. In addition, NFS maintains the consistency of pages cached on clients, so that all clients see new versions of pages read by any client.

Performance tests show that NFS clearly outperforms HTTP on a local area network. NFS can transfer an 8KB page between the server and client in 3.5ms, whereas HTTP requires 20ms. The NFS protocol also adds several implicit benefits such as automatic client-side caching and cache consistency. The same 8KB transfer from the client file cache requires only 0.3ms. Gecko not only provides a familiar interface to the Web, but improves its performance as well.

2.0: Unix vs. Web Names

Providing access to the Web via NFS requires that Web pages have names in the NFS name space. NFS uses the standard Unix directory hierarchy to name files. A file name consists of component names separated by ‘/’ characters, each component specifying a name mapping within the particular directory. For example, the name /a/b specifies the file or directory b within the directory a, within root directory ‘/’. A file or directory may have many names, but each name refers to a single file or directory.

Naming Web pages is more complicated. A Web page is accessed via its URL, also a sequence of names separated by ‘/’ characters. Although its format is similar to a Unix file name, it does not imply a hierarchical directory structure. For example, the URL http://foo.com/A/index.html refers to the page A/index.html on the server foo.com. Most Web servers have adopted a scheme that maps the URL to a Unix file name directly. In this case the page A/index.html would be stored in the file A/index.html relative to the root of the Web server’s file system.

This organization leads to the first Web naming scheme, the URL hierarchy. Because URLs look like Unix files names, an implicit hierarchy can be formed from the URL namespace. For example, the URL http://www.foo.com/A/index.html can be thought of as an entry in http://foo.com/A/. Although it is tempting to map the URL hierarchy directly to the Unix directory hierarchy in this fashion (e.g. map the URL http://foo.com/A/index.html to the Unix file name /foo.com/A/index.html), there are several complications. First, unlike the Unix directory hierarchy the URL hierarchy may have disconnected components. Although http://foo.com/A/index.html is a valid URL, http://foo.com/A/ may not be; an attempt to access the page at the latter URL will fail because the page doesn’t exist.

Second, even if http://foo.com/A/ exists it may not be possible to determine its children in the URL hierarchy. A Unix directory is a special type of file that contains the names of other files and directories (its children), forming the directory hierarchy. Reading a directory provides a list of its entries. The same is not true for a Web page. The HTML contents of a page may bear little or no relationship to its children in the URL hierarchy. It is possible (and common) for http://foo.com/A/index.html to exist even if the page at http://foo.com/A/ doesn’t have a reference to it.

Many Web pages are stored in HTML format, allowing pages to be named by the HTML link graph. HTML pages contain references to URLs containing related pages, embedded images, frames, etc. There need be no relationship between a page’s URL and the URLs of the links it contains; they may be in different directories on the same server or different servers altogether. Thus any page can have a link to any other page, creating to a directed graph that may contain cycles.

To complicate matters further, neither a page’s URL nor its location in the link graph uniquely identifies its contents. When a client requests a page it specifies a set of request headers that may be used by the server to tailor the contents of the page. The headers are ASCII strings that include such information as the client’s preferred language (e.g. English), browser type and version, authentication information, etc. The Web server can use this information to modify the page contents; for example, a URL requested with a Spanish language preference may yield different results from the same URL requested with an English preference. The request headers are an integral part of naming, as changing the headers changes the resulting page. Since an arbitrary number of headers may be specified, and each header may be of an arbitrary length, an arbitrary length string is required to identify a Web page uniquely.

Figure 1: Gecko Architecture. Gecko acts as an intermediary between the complex semantics of HTTP and the standardized access methods of NFS.

There are several complications with mapping the Web’s dual-naming schemes to Unix file names. First, a user may specify a page through either the link graph, by specifying a link within another page, or the URL directly. There may be many links to the same URL, and many URLs for the same page. Second, although every page must have a URL, not every page is accessible from the link graph, i.e. some pages may have no links that point to them. This means that both the link graph and the URL hierarchy must be encoded in the Unix naming scheme. Last, request headers enable a single URL to refer to multiple pages.

2.1 Mapping Web Names to Unix Names

Gecko’s naming scheme is based on mapping a URL directly to a Unix file name by treating each component of the URL as a directory. The URL http://foo.com/A/index.html is converted into the Unix file name /foo.com/A/index.html. http://foo.com/A/ is implemented as the Unix directory /foo.com/A, with an entry for index.html. Every Web page is represented by a Unix directory whose contents reflect the structure of the page. Included in the directory is a .contents file that contains the HTML contents of the page, and a .headers file that contains the response headers returned with the page. Each hyperlink in the page is represented by a symbolic link starting with the prefix .link that points to the target of the symbolic link. For example, /foo.com/A/index.html is a Unix directory representing the page at URL http://foo.com/A/index.html. The file /foo.com/A/index.html/.contents contains the HTML contents, and /foo.com/A/index.html/.headers contains the request headers. If the page contains a reference to http://bar.com/misc.html then the directory /foo.com/A/index.html will contain a symbolic link to ../../bar.com/misc.html. This arrangement allows both the URL hierarchy and the link graph to be traversed in the Unix name space.

There are several ways of integrating request headers into this naming scheme. A simple solution is to embed them in the Unix name for the page, by concatenating them with the page’s URL. For example,

HTTP Request: Unix Request:
http://www.myhost.com/foo/bar.html
Language: English
User-Agent: Mozilla/3.02
/web/language_english/user-agent_mozilla_3.02/
www.myhost.com/foo/bar.html

The downside of this approach is that the Unix pathname grows very large and unwieldy. Most Web requests have at least a half-dozen headers, and with advanced HTTP features such as cookies, the total length of the headers may exceed 1KB, the maximum length of a file name on many Unix systems. Even if the headers were to fit in the file name, standard Unix applications and users would have to contend with long and complex pathnames.

Another possibility is to associate implicitly the request headers with the desired URL. In this solution, the Gecko server maintains a set of headers to send to Web servers along with the requested URLs. Clients use an out-of-band communication channel to modify the headers on the server. As long as the headers are set up appropriately, this method allows applications and users to access pages by specifying only the URL. An example is:

HTTP Request: Unix Request:
http://www.myhost.com/foo/bar.html
Language: English
User-Agent: Mozilla/3.02
set "Language: English" via alternate channel
set "User-Agent: Mozilla/3.02" via alternate channel
use pathname /web/www.myhost.com/foo/bar.html

This approach solves the problem of pathname complexity by reducing the pathname to the URL, which is manageable both by users and by existing processes. However, it does not support concurrent accesses. If, for example, two different users wish to use different sets of headers to access the Web, their modifications to the set of headers may conflict and lead to unpredictable results. The only way to avoid this conflict to serialize requests by multiple users, which would prohibit multitasking and impose a significant performance penalty.

2.1.1 Header Sets

In Gecko we use a hybrid approach based on both an extension to the file system name space and implicit headers. The Gecko server maintains a collection of header sets, each containing one or more request headers. Clients can create and delete header sets, as well as modify their contents. Each header set has a unique identifier, and the Unix name space is extended by adding this identifier to the beginning of the Unix file name. This allows the user or application to specify which header set should be used when fetching a particular URL, and avoids the synchronization problems of a single header set. For example,

HTTP Request: Unix Request:
http://www.myhost.com/foo/bar.html
Language: English
User-Agent: Mozilla/3.02
put "Language: English" into header set #1234
put "User-Agent: Mozilla/3.02" into header set #1234
use pathname /web/#1234/www.myhost.com/foo/bar.html

This is a compromise between long file names and simplicity. Each client can maintain its own header set on the server, avoiding race conditions between clients. Clients can share header sets as long as they synchronize between themselves, and the server can share the underlying storage among header sets with the same contents. Efficiency is improved since each client process can manipulate its own header set independently, and only needs to modify the headers as necessary. Many headers, such as the browser type, may never change, while others change slowly. Each client can update its own headers as appropriate, without interacting with other clients. A default header set is available for applications that do not wish to use header sets.

2.1.2 Privacy Concerns

As the Web is used for sensitive commercial and government work, security and privacy measures are necessary. It is not only necessary to prevent one user from viewing private data requested by another, but also to protect the knowledge that such data objects were even passed through Gecko.

In an unprotected Gecko system, any user could traverse the directory hierarchy. Issuing the Unix ls command could leak information about the names and URLs that have passed through Gecko and issuing the Unix cat command could leak the data pages themselves. Gecko uses a ownership system to protect branches of the directory hierarchy. Any pages retrieved with the default header set are automatically public and may be shared amongst all users. This sharing promotes increased performance and we encourage most clients to use the default header set whenever possible. Pages that are retrieved using a private header set are restricted to only the user who owns the header set.

Ownership is communicated through standard NFS mechanisms that are quite similar to Unix user-id and group-id permission bits. Any mechanism that supports or enhances NFS security also applies to Gecko security since the protocol support is identical.

2.2 An Example

The following example illustrates how the Web is accessed through Gecko. In the following the user has mounted the Gecko file system as the directory /web. The user begins by issuing a cd command into the directory structure and into the University of Arizona home page:

>cd /web/www.cs.arizona.edu
>ls -ao
total 31
-rw-rw-rw- 2 nobody 3550 Oct 15 17:15 .contents
-rw-rw-rw- 2 nobody 257 Oct 15 17:15 .headers
lrwxrwxrwx 2 nobody 4096 Dec 31 1969 .link0 -> .root/w3.arizona.edu:180/enroll/
lrwxrwxrwx 2 nobody 4096 Dec 31 1969 .link1 -> root/www.arizona.edu/estudents.html/
lrwxrwxrwx 2 nobody 4096 Dec 31 1969 .root -> ../
drwxrwxrwx 2 nobody 4096 Dec 31 1969 estudents.html/

(subsequent directory entries eliminated for brevity)

The directory for the www.arizona.edu page contains several files: the .contents file containing the page’s HTML; a .headers file that contains the headers returned when the page was originally fetched from the Web server; and several .link files, each representing a hyperlink in the page and is a symbolic link to the Gecko path name of the hyperlink’s target. The symbolic links are relative to allow different clients to mount Gecko at different places in their file systems; the .root symbolic link reduces the complexity of the links in the .link files by factoring out the common prefix.

The user views the contents and headers of the Web page as follows:

>cat .headers
HTTP/1.0 200 Document Follows
Date: Fri, 09 Oct 1998 19:28:32 GMT
Server: Apache/1.2.4
Last-Modified: Thu, 08 Oct 1998 21:01:27 GMT
ETag: "2906-ded-361d2827"
Content-Length: 3565
Accept-Ranges: bytes
Connection: close
Content-Type: text/html
>cat .contents
<html><head> <TITLE>The University of Arizona</TITLE> </head>

(remainder eliminated for brevity)

Following one of the hyperlinks is as simple as using the cd command to change to the directory that represents the target of the link. When this is executed, the Linux pwd/getcwd algorithm automatically resolves the symbolic link and determines the correct directory.

>cd .link1;pwd
/web/www.arizona.edu/estudents.html

3.0 HTTP vs. NFS Transfer Protocols

Not only does Gecko provide access to the Web via the Unix file system interface, it also uses the NFS protocol instead of HTTP to transfer page contents from the Gecko server to the client. This improves performance and also ensures consistency between pages cached on the clients and on the server. All clients see the same version of the page that is cached on the server.

The protocol for transferring a file between a server and a client can be broken down into four distinct phases. Name resolution maps the textual representation of the file name into an internal representation that uniquely identifies the file and is used by subsequent phases. Attribute retrieval retrieves essential file information necessary to complete the transfer, such as permissions, content type and length, and modification time. Content transfer transfers the file data from the server to the client, and the cleanup phase closes the file and terminates the transfer. Although the HTTP and NFS transfer protocols are quite different, Gecko provides the entire HTTP functionality via the NFS protocol.

3.1 HTTP Transfer Protocol

HTTP performs the entire transfer via a single TCP connection. TCP [7] connections operate as bi-directional character streams or pipes in which the client may write an arbitrary number of bytes to the server and the server may respond by writing an arbitrary number of bytes to the client. There is no explicit request/reply order inherent in the TCP protocol, nor is there any explicit bundling of bytes into fixed-size records. The client opens the TCP connection to the server, sends the URL for the requested page and associated headers to the server, and receives the requested page over the same connection. The client can then close the connection or request another page on the same connection. Since all four phases of file transfer are accomplished in one connection, there is no need for an HTTP server to preserve state between connections.

Name resolution in HTTP is a two-step process. First, the client must extract the server’s name from the URL and use this name to create a connection to the server. The client then sends the URL and request headers to the server, and the server converts them into an internal representation; for most Web servers this is the file name of the page relative to the server’s local file system. As described previously, both the URL and request headers determine the internal representation, so that two requests with identical URLs but different headers may map to two entirely different files on the server.

Attribute retrieval is performed next. Typically a Web server acquires attributes such as the file size and modification time from the attributes of the file that stores the page. Other attributes, such as the content type of the file or cookies to be returned along with the page, are generated by the Web server itself. All headers are sent to the client as a series of plain text strings, and a blank line terminates the block of headers.

Content transfer is performed by sending the entire contents of the page over the TCP connection. HTTP does not perform any error checking or flow control itself, instead depending on the TCP protocol for these. Although newer specifications of HTTP do permit the client to request only a portion of the file contents, most HTTP implementations require the entire contents to be transmitted at once.

Transfer termination is done by closing the TCP connection. The TCP protocol is responsible for communicating the end of transmission and other cleanup details. Once the client detects a closed connection, it knows the page has been received.

3.2 NFS Access Sequence

Rather than handling the entire transfer as one bulk connection, NFS breaks the transfer down into several functions, each of which may be executed independently. NFS uses UDP as the communication protocol between machines, instead of HTTP’s TCP. UDP is a non-reliable, datagram protocol that has less overhead than TCP, but also provides less functionality.

Name resolution is performed on both the client and the server. The client breaks the file name into its individual components, and contacts the server to resolve each independently. For each component, the client sends the server the handle for the parent directory and the component name. An NFS handle is a 64-byte quantity that uniquely identifies a file. The server replies to the request with a handle for the child. The handle for the root directory is given to the client when the file system is mounted. This style of name resolution allows the client to cache the results for use during subsequent name resolutions.

An NFS client receives file attributes as the result of certain function calls (such as Lookup), and it can request them explicitly via the GetAttr NFS function. This function is used primarily to verify that the cached copy of a file is up-to-date, by comparing its modification time with that of the server’s copy. NFS attributes are fixed-size and only include very common Unix attributes such as permissions, file size, modification time, etc. There is no provision for including arbitrary header information as with HTTP.

The NFS Read function is used to transfer the contents of a file from the server to the client. The client specifies the handle of the file from which to read, the offset within the file at which the read should start, and the number of bytes to read. The client can issue reads of any size, although most operating systems break reads into manageable units of 8KB or smaller. Reads may be issued out of order (i.e. random seek) and may skip unwanted sections of the file. Most client operating systems support some form of read ahead and/or parallel read capability to improve performance.

NFS transfers are not explicitly terminated as are HTTP transfers. A client can cache a file handle indefinitely and use it to read from the file. Most clients poll the server periodically using the GetAttr function to verify the consistency of cached files; the polling period is typically 3-60 seconds, depending on the rate at which the file is modified. This implies that while a client can hold a handle indefinitely, accesses to the file will be preceded by a Lookup or GetAttr request at most 60 seconds beforehand.

4.0 Prototype Implementation

To validate the Gecko design we implemented a prototype Gecko server. The server is implemented in the C programming language on the Linux operating system, version 2.0.32. The Linux thread library, Pthreads, was used for synchronization, and the SunRPC library was used for communication with the clients. The Gecko clients can run any operating system that implements NFS.

4.1 Handles

HTTP pages are uniquely identified by URLs and request headers, NFS files are uniquely identified by NFS handles. URLs and request headers provide a very large namespace, whereas an NFS file handle is only a 64-byte string. In general, the NFS handle is too small to contain the URL and request headers. Instead, the server must store the URL and request headers for each handle and use information stored in the handle to find the correct URL and headers when the page is accessed. The Gecko server is therefore stateful, and stores a significant amount of information about each handle.

A further complication is that there are millions of pages on the Web, and it is impractical for the server to create a unique handle for each one of them and store the URL and header information. The server has room for relatively few handles, and will have to discard handles and reuse the space they occupy as new handles are created. This makes it impossible to guarantee that each Web page has a unique NFS handle, leading to the possibility of handle conflicts. In contrast, a standard NFS server constructs the handles from the file information stored on its disk, ensuring that each file has a unique handle.

Gecko uses the MD5 checksum algorithm [9] to generate NFS handles that are uniformly distributed within the space of valid handles. The MD5 algorithm is cryptographically secure, ensuring no two URLs map to the same handle. The page’s URL is used as the input to the MD5 checksum. The MD5 checksum is 32-bytes in length and is used as the first half of the 64-byte NFS handle. The remaining 32 bytes contain the header set information and miscellaneous file-specific flags.

An administrator configurable limit (65,535 by default) is imposed on the number of handles existing in the system at one time. If this limit is exceeded, then handles are discarded on a LRU basis. A handle is only discarded if it hasn’t been used in an access within the last 60 seconds. We assume that clients adhere to the default NFS behavior of checking the consistency of cached files and attributes at least every 60 seconds. If Gecko were to discard a handle within 60 seconds of its last use, then it is possible that the client is still using the handle and will issue another request, such as a read, on the handle. In this case, Gecko would have discarded the URL and request headers for the handle, and the request fails. If, however, Gecko only discards handles that haven’t been used for 60 seconds the next request that uses the handle will probably be preceded by Lookup and GetAttr requests. These provide the Gecko server with enough information to regenerate the URL and request headers for the page. The only time that the next request for a discarded handle won’t be preceded by a Lookup and GetAttr is if an application has had the file open the entire time. In this case, the client will re-use the handle without validating its consistency. Thus Gecko may not handle files that are open longer than 60 seconds correctly in the case where handle discards and re-use is necessary, but we assume that this is a rare occurrence.

If a request is made that necessitates creating a hew handle, but all existing handles have been used within the last 60 seconds, the NFS request is dropped. The client then retries the request after the standard NFS timeout, and will continue to do so until the request succeeds. In the default server configuration of 65,535 handles, it is unlikely that they will all be used in the last 60 seconds, so dropping a request should be rare.

The server stores handle information in a gdbm [6] database. Gdbm allows the handle information to exceed the server’s physical memory and be stored to disk automatically. Gdbm is operated in fast mode with consistency relaxed in order to improve performance. In addition to the URL, each handle record also contains information on its children to assist in directory read operations.

4.2 Caches

Caching is central to the Gecko project, since in a LAN configuration data may be shared by multiple clients. The Gecko server has two caches: an in-memory cache that holds files that are presently open and memory-mapped, and a larger disk cache. Gecko keeps recently accessed files open and mapped into the server’s memory, so that accesses to those files are fast. Read operations, which may occur in rapid succession on a given file, are served directly from the memory cache whenever possible and bypass the disk cache, file locking layer, and gdbm handle storage.

The disk cache stores Web pages in standard Unix files, named by the MD5 checksum of the pages’ URLs. The cache is configured as a tree structure with each node (directory) containing 256 entries to avoid the performance degradation caused by large Unix directories. The MD5 checksum is treated as a file name consisting of 32 component names that are used to traverse the disk cache directory tree. The size of and load on the directories in this tree should be fairly balanced because MD5 distributes checksums uniformly throughout the checksum space.

4.4 Client Proxy

Applications, such as browsers, already implement the HTTP interface. Gecko has been augmented to allow standard HTTP connections in addition to NFS connections, allowing the browser to connect directly to the Gecko server proxy. However, this approach does not promote a shared client cache as does the NFS approach. Each browser application on a client computer must implement its own private cache.

A better solution is to install a client proxy that is responsible for translating HTTP requests into NFS requests that are resolved by the operating system (Figure 2). An application, such as Netscape Navigator, is then directed to the client proxy. Navigator’s built in disk cache may be disabled, and the operating system file cache will provide automatic cache support.

Figure 2: Client Proxy Configuration. The Client Proxy translates local HTTP requests from a browser application into NFS requests which are automatically resolved by the operating system.

Two details of the HTTP protocol, refresh and redirections, complicate the client proxy implementation. Refresh and redirection rely on special purpose mechanisms built into the HTTP protocol that are not directly supported by NFS.

Refresh occurs when a client application wishes to force a fresh copy of a page to be fetched. Since NFS files are cached automatically by Linux, a cached copy is always provided if it is available. The workaround adopted by Gecko is for the client to first delete the file, and then re-open it, forcing a new copy to be fetched.

Redirection presents a special problems for URL’s which contain a trailing slash. NFS does not support the trailing slash and automatically removes it, leading to an ambiguity where it is impossible to distinguish a URL with a trailing slash from one without. The Gecko solution is to use an extra bit in the NFS attributes to stand in place of the trailing slash, and then to trick the client application into doing the right thing.

4.5 Status

The Gecko server and client proxy are fully implemented. There are no known problems, and we have successfully used Gecko to access the Web via a browser and through standard Unix commands. We have used the Unix cd and ls commands to traverse the Web, and the cp command to copy Web pages to the local file system. Executables can be invoked directly, and libraries can be used to link programs. The Unix command grep can be used to search Web pages, and we have built a simple spider and search engine using find and grep together. In general the experience has been positive, and the advantages of providing a file-system interface to the Web readily apparent.

5.0 Experimental Results

Quantitative measurements of Gecko demonstrate its performance advantages over HTTP as a transport protocol. Tests were performed using two workstations connected via a 100Mb/sec. Ethernet switch. One workstation was configured as the Gecko server and the other workstation as the client. Each workstation is a 200 Mhz Pentium machine containing 128MB of memory and 2 to 4GB of disk storage. Workstations were configured with Linux kernel version 2.0.30 and pthreads version 0.5.

We measured the time taken to transfer a single page between the server and the client via HTTP and NFS. The page size was varied from 256B to 1MB. Each test was replicated 31 times with the first trial thrown out and the remaining 30 trials averaged to provide the result. We performed tests using the following protocols and configurations:

  1. Via HTTP to our own implementation of a HTTP proxy server.
  2. Via NFS to the Gecko server, including both the ‘file open’ overhead and the data transfer.
  3. Via NFS to the Gecko server, including only the data transfer and not the ‘file open’ overhead. This test was executed to isolate the fixed NFS lookup overhead from the time required to transfer data.
  4. Via Unix system calls to the local cache on the client. This test measures the speed at which a client may access locally cached data.
  5. Via HTTP to the Apache [8] web server (version 1.2.6). Apache is the most prevalent Web server today.

5.1 Performance Results

Figure 3: Page Transfer Time vs Page Size. Graph results compare the time in milliseconds to retrieve a page of the given size using the protocols specified in Table #1. Smaller numbers indicate better performance, and it is clearly shown that NFS outperforms HTTP for all transfers in the range of 2048 bytes to 1MB.

The Gecko server executing the NFS protocol outperforms the HTTP protocol on files in the range 2KB to 1MB (Figure 3). Web pages are typically in the 8KB-32KB range [1], in which NFS significantly outperforms HTTP. We believe this increase is due in part to the congestion control mechanisms inherent in TCP, the underlying protocol of HTTP. NFS is usually operated on top of the UDP protocol and implements its own specialized reliable delivery and flow control mechanisms which favor a local area network.

At page sizes less than 2.0KB, Gecko was 0.6ms slower than HTTP due to the extra overhead of the NFS Lookup procedure call. The discrepancy is due in part to the unrealistically high performance of our HTTP server caused by its implementation of only a small subset of the HTTP protocol. When compared to the full-featured Apache Web server implementation, Gecko is found to outperform the Apache implementation at all of the file sizes we tested.

We hypothesized that the NFS Lookup request would require a fixed amount of overhead per file transfer. Comparing the results for NFS benchmarks with and without the lookup time included, we see that all transfers smaller than 64KB have a lookup overhead of approximately 1ms. Transfer sizes above 128KB show less impact of the lookup overhead as the magnitude of time necessary to transfer the data forces the lookup overhead to be relatively negligible. This is to be expected, since the NFS pathname resolution needs to only be performed once at the beginning of the transfer, whereas reading the data requires many repeated NFS function calls.

Results obtained from reading data in the client side cache revealed low transfer times as expected. In these cases, no communication was required from client to server, eliminating any network transfer component from the benchmark results. These results validate our hypothesis that the built-in Linux client side caching eliminates the need for applications to cache Web pages themselves, as does Netscape Navigator. For example, a cached read of an 8KB page requires only 0.3ms.

6.0 Related Work

The WebNFS [3] project, a proposal by Sun Microsystems, seeks to replace the HTTP protocol with a more Internet-compatible version of the NFS protocol. WebNFS offers many of the same performance improvements as Gecko, such as the ability to read a page without requiring a dedicated TCP connection. However, the specification departs significantly from traditional NFS semantics, suffers compatibility problems with existing applications and is designed mainly for integration with specific web-enabled applications. It is also unclear how WebNFS servers will cope with HTTP headers that modify content such as multiple language or browser types.

WebFS [11], a project of UC Berkeley, is designed with goals similar to that of Gecko. WebFS is implemented as a loadable kernel module for the Solaris operating system and provides a mapping of the URL namespace into a file system namespace. While WebFS seeks to maintain compatibility with HTTP, the true goal of the WebFS is to replace HTTP with a more flexible protocol that provides support for cache coherency and authentication. WebFS requires end-users to modify the operating system by installing a kernel module, whereas the Gecko prototype is accessible to unmodified clients through the standard NFS protocol. WebFS does not provide server-side parsing of document links as does Gecko, and requires a special client-side utility program to be run to parse links and expand the namespace. WebFS does not provide a mechanism for communicating HTTP request headers.

Operating system vendors, such as Microsoft, continue to integrate the Internet into their products. However, these integrations are usually performed at the user interface level and not the file system level. The result is that although users perceive an integration of the Web with their computers, Web content is not available to most common applications and utilities. Such solutions continue to use HTTP and suffer the impact of its performance implications.

The Alex [4] system operates as an NFS front end to the FTP protocol in much the same manner that Gecko operates as a front end to the Web. Alex provides transparent access of FTP sites to application programs and provides cache consistency.

7.0 Conclusion

Gecko provides access to the Web via NFS, allowing Web pages to be named, accessed, and cached as are Unix files. Standard Unix applications such as cat and grep can be used to manipulate pages, eliminating the need for new "Web aware" applications. NFS provides cache consistency between clients and the Gecko server, ensuring that all applications using Gecko system see the same version of a page. NFS also improves the performance of accessing pages on the Gecko server. Pages are automatically cached on the client, and name lookup results are cached for subsequent lookups, significantly improving their performance. The use of UDP vs TCP to transfer pages enables Gecko to transfer a 16KB page between the server and client 6.5 times faster than HTTP.

8.0 References

[1] Martin F. Arlitt and Carey L. Williamson. Internet web servers: Workload characterization and performance implications. IEEE/ACM Transactions on Networking, 5(5):631-6545, October 1997.

[2] T. Berners-Lee and D. Connolly, RFC 1866: HTML 2.0 Specification. http://www.w3.org/MarkUp/html-spec/.

[3] B. Callaghan, WebNFS Server Specification. http://194.52.182.97/rfc/rfc2055.html.

[4] Vincent Cate. Alex - a Global File System. Proceedings of the 1992 USENIX File Systems Workshop, pages 1-12. May 1992.

[5] R. Fielding, J. Gettys, J.C. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext transfer protocol - HTTP/1.1. http://www.ietf.cnri.reston.va.us/internet-drafts/draft-ietf-http-v11-spec-rev-03.txt

[6] Free Software Foundation, Gdbm: The GNU database library. http://www.cl.cam.ac.uk/texinfodoc/gdbm_toc.html

[7] Information Sciences Institute, RFC 793: Transmission Control Protocol. Http://www.alternic.net/rfcs/700/rfc793.txt.html

[8] Ben Laurie and Peter Laurie, Apache: The Definitive Guide, O’Reilly& Associates, Sebastopol, CA, 1997.

[9] R. Rivest, RFC 1321: The MD5 Message-Digest Algorithm. http://www.alternic.net/rfcs/1300/rfc1321.txt.html

[10] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. Design and implementation of the Sun network filesystem. In Proceedings of the Summer 1985 USENIX Conference, pages 119-130, June 1985.

 
[11] Amin M. Vahdat, Paul C. Eastham, and Thomas E. Anderson, WebFs: A global Cache Coherent File System. Http://www.cs.berkeley.edu/~vahdat/webfs/webfs.html.
 
 
 
Scott M. Baker is an independent software consultant and received his
Bachelor and Master of Science degrees from the University of Arizona.
His interests include networking, graphics, and distributed information
systems.
 

John H. Hartman is an assistant professor in the Department of Computer Science at The University of Arizona. His research interests include distributed systems, operating systems, and file systems. He received a Ph.D. in Computer Science from the University of California at Berkeley in 1994 and an M.S. in 1990, and an Sc.B. in Computer Science from Brown University in 1987.