2.1.2 Step 2: Duplicate Elimination

After building this graph we combine . We say two nodes are near-duplicates if (a)

they each have more than 10 links and (b) they have at least 95% of their links in common. To

combine two near-duplicates we replace their two nodes by a node whose links are the union of

the links of the two near-duplicates. This duplicate elimination phase is important because many

pages are duplicated across hosts (e.g. mirror sites, dierent aliases for the same page), and we

have observed that allowing them to remain separate can greatly distort the results.

2.1.3 Step 3: Assign Edge Weights

In this stage, we assign a weight to each edge, using the edge weighting scheme of Bharat and

Henzinger [5] which we repeat here for completeness. An edge between two nodes on the same

host 1

has weight 0. If there are k edges from documents on a rst host to a single document on a

second host we give each edge an of 1=k. This weight is used when computing the

authority score of the document on the second host. If there are l edges from a single document

on a rst host to a set of documents on a second host, we give each edge a of 1=l. We

perform this scaling to prevent a single host from having too much in uence on the computation.

We call the resulting weighted graph the of u.

2.1.4 Step 4: Compute Hub and Authority Scores

In this step, we run the algorithm [5] on the graph to compute and scores. The

algorithm is a straightforward extension of the HITS algorithm with edge weights.

The intuition behind the HITS algorithm is that a document that points to many others is

a good hub, and a document that many documents point to is a good authority. Transitively, a

document that points to many good authorities is an even better hub, and similarly a document

pointed to by many good hubs is an even better authority. The HITS computation repeatedly

updates hub and authority scores so that documents with high authority scores are expected to

have relevant , whereas documents with high hub scores are expected to contain to

relevant content. The computation of hub and authority scores is done as follows:

Initialize all elements of the hub vector H to 1.0.

Initialize all elements of the authority vector A to 1.0.

While the vectors H and A have not converged:

For all nodes n in the vicinity graph N,

A[n] := P(n 0

;n)2edges(N) H[n
0

] authority weight(n
0 ; n)

For all n in N,

H[n] := P(n;n 0

)2edges(N) A[n 0

] hub weight(n; n 0

)

Normalize the H and A vectors.

Note that the algorithm does not claim to nd relevant pages, since there may be some that

have good content but have not been linked to by many authors.

The Companion algorithm then returns the nodes with the ten highest authority scores (excluding u itself) as the pages that are most related to the start page u.

1 We assume throughout the paper that the can be determined from the URL-string.

5