2. Contract duplicates and near-duplicates in this graph.

3. Compute edge weights based on host to host connections.

4. Compute a score and an score for each node in the graph and return the top

ranked authority nodes (our implementation returns the top 10). This phase of the algorithm

uses a modied version of the HITS algorithm originally proposed by Kleinberg [17].

These steps are described in more detail in the subsections below. Only step 1 exploits the order

of links on a page.

2.1.1 Step 1: Building the Vicinity Graph

Given a query URL u we build a directed graph of nodes that are nearby to u in the web graph.

Graph nodes correspond to web pages and graph edges correspond to hyperlinks. The graph consists

of the following nodes (and the edges between these nodes):

1. u,

2. up to B parents of u, and for each parent up to BF of its children dierent from u, and

3. up to F children of u, and for each child up to FB of its parents dierent from u.

Here is how we choose these nodes in detail: There is a stoplist STOP of URLs that are unrelated

to most queries and that have very high indegree. Our stoplist was developed by experimentation,

and currently contains 21 URLs. Examples of nodes that appear on our stoplist are www.yahoo.com

and www.microsoft.com/ie/download.html. If the starting URL u is not one of the nodes on our

stoplist, then we ignore all the URLs on the stoplist when forming the vicinity graph. If u does

appear on the stoplist, however, then we disable the stoplist (i.e. set STOP to the empty list) and

freely include any nodes in the vicinity graph. We disable the stoplist when the input URL appears

on the stoplist because many nodes on the stoplist are popular search engines and portals, and we

want to permit these nodes to be considered when the input URL is another such popular site.

: If u has more than B parents, add B random parents not

on STOP to the graph; otherwise add all of u's parents. If a parent x of u has more than BF + 1

children, add up to BF=2 children pointed to by the BF=2 links on x immediately preceding the

link to u and up to BF=2 children pointed to by the BF=2 links on x immediately succeeding the

link to u (ignoring duplicate links). If page x has fewer than BF children, we add all of its children

to the graph. Note that this exploits the order of links on page x.

: If u has more than F children, add the children pointed

to by the rst F links of u; otherwise, add all of u's children. If a child of u has more than BF

parents, add the BF parents not on STOP with highest indegree; otherwise, add all of the child's

parents.

If there is a hyperlink from a page represented by node v in the graph to a page represented by

node w and v and w do not belong to the same host, then there is a directed edge from v to w in

the graph (we omit edges between nodes on the same host).

In our experience, we have found that using a large value for B (2000) and a small value for BF

(8) works better in practice than using moderate values for each (say, 50 and 50). We have observed

that links to pages on a similar topic tend to be clustered together, while links that are farther

apart on a page are less likely to be on the same topic (for example, most hotlists are grouped into

categories). This has also been observed by other researchers [9]. Using a larger value for B also

means that the likelihood of the computation being dominated by a single parent page is reduced.

4