for Site A. Generally, this happens when the number of sites with strong relationships is greater
than ten, or when sites do not have similar enough content.
In experimenting with these algorithms, we were fortunate to have access to Compaq's Connectivity
Server . The Connectivity Server provides high speed access to the graph structure of 180 million
URLs (nodes) and the links (edges) that connect them. The entire graph structure is kept in memory
on a Compaq AlphaServer 4100 with 8 GB of main memory and dual 466 MHz Alpha processors.
The random access patterns engendered by the connectivity-based algorithms described in this
paper mean that it is important for most or all of the graph to t in main memory to prevent high
levels of paging activity.
We implemented a multi-threaded server that accepts a URL uses either the Cocitation algorithm or the Companion algorithm to nd pages related to the given URL. Our server implementation consists of approximately 5500 lines of commented C code, of which approximately 1000
lines implement the Companion algorithm, 400 lines implement the Cocitation algorithm, and the
remainder are shared code to perform tasks such as parsing HTTP query requests, printing results,
and logging status messages. We link our server code directly with the Connectivity Server library,
and access the connectivity information by mmapping the graph information into the address space
of our server.
Our implementation of the Companion algorithm has been subjected to a moderate amount
of performance tuning, mostly in designing the neighborhood graph data structures to improve
data-cache performance. The implementation of the Cocitation algorithm has not been tuned
extensively, although it does share a fair amount of code with the Companion algorithm, and this
shared code has been tuned somewhat.
In this section we describe the evaluation we performed for the algorithms. Section 4.1 describes
our user study, while Section 4.2 discusses the results of the study. Section 4.3 evaluates the run
time performance of our algorithms.
4.1 Experimental Setup
To compare the dierent approaches, we performed a user study. We asked 18 volunteers to supply
us with at least two URLs for which they wanted to nd related pages. Our volunteers included 14
computer science professionals (mostly our colleagues at Compaq's research laboratories), as well
as 4 people with other professional careers. We received 69 URLs and used each of the algorithms
to determine the top 10 answers for each URL. We put the answers in random order and returned
them to the volunteers for rating. The volunteers were instructed as follows:
We want to measure how well each algorithm performs. To measure performance we want to
know the percentage of valuable URLs returned by each algorithm. To be valuable the URL must
be both relevant to the topic you are interested in and a high quality page. For example, if your
URL was www.audi.com and you get back a newsgroup posting where somebody talks about his
new Audi car, then the page was on topic, but probably not high quality. On the other hand, if
you get www.jaguar.com as an answer, then it is up to you to decide whether this answer is on
topic or not.