2.1.2 Step 2: Duplicate Elimination
After building this graph we combine . We say two nodes are near-duplicates if (a)
they each have more than 10 links and (b) they have at least 95% of their links in common. To
combine two near-duplicates we replace their two nodes by a node whose links are the union of
the links of the two near-duplicates. This duplicate elimination phase is important because many
pages are duplicated across hosts (e.g. mirror sites, dierent aliases for the same page), and we
have observed that allowing them to remain separate can greatly distort the results.
In this stage, we assign a weight to each edge, using the edge weighting scheme of Bharat and
Henzinger  which we repeat here for completeness. An edge between two nodes on the same
has weight 0. If there are k edges from documents on a rst host to a single document on a
second host we give each edge an of 1=k. This weight is used when computing the
authority score of the document on the second host. If there are l edges from a single document
on a rst host to a set of documents on a second host, we give each edge a of 1=l. We
perform this scaling to prevent a single host from having too much in uence on the computation.
We call the resulting weighted graph the of u.
2.1.4 Step 4: Compute Hub and Authority Scores
In this step, we run the algorithm  on the graph to compute and scores. The
algorithm is a straightforward extension of the HITS algorithm with edge weights.
The intuition behind the HITS algorithm is that a document that points to many others is
a good hub, and a document that many documents point to is a good authority. Transitively, a
document that points to many good authorities is an even better hub, and similarly a document
pointed to by many good hubs is an even better authority. The HITS computation repeatedly
updates hub and authority scores so that documents with high authority scores are expected to
have relevant , whereas documents with high hub scores are expected to contain to
relevant content. The computation of hub and authority scores is done as follows:
Initialize all elements of the hub vector H to 1.0.
Initialize all elements of the authority vector A to 1.0.
While the vectors H and A have not converged:
For all nodes n in the vicinity graph N,
A[n] := P(n 0
] authority weight(n 0 ; n)
For all n in N,
H[n] := P(n;n 0
Normalize the H and A vectors.
Note that the algorithm does not claim to nd relevant pages, since there may be some that
have good content but have not been linked to by many authors.
The Companion algorithm then returns the nodes with the ten highest authority scores (excluding u itself) as the pages that are most related to the start page u.
1 We assume throughout the paper that the can be determined from the URL-string.