<- Previous | First | Next ->

de ned by the combined link structure of a large number of starting nodes on the topic. Kleinberg

also proposed that the HITS algorithm could be used to nd related pages when the topic was

de ned by just a single node. The Companion algorithm used HITS algorithm as a starting point

and extended and modi ed it in four main ways:
1. Kleinberg suggested using the following graph to nd related pages: Take a xed number (say

200) of parents of the given URL and call the set consisting of the URL and these parents the

. Now build the graph consisting of all nodes pointing to a node in the start set or

pointed to by a node in the start set. This means that \grandparents" of u are included in

the graph, while nodes that share a child with u are not included in the graph. We believe

that the latter nodes are more likely to be related to u than are the \grandparent" nodes.

Therefore our vicinity graph is structured to exclude grandparent nodes but to include nodes

that share a child with u.
2. We exploit the order of the links on a page to determine which \siblings" of u to include.

When we added this feature, the precision of our algorithm improved noticably.
3. The original HITS algorithm did not have edge weights. We use edge weights to reduce the

in uence of pages that all reside on one host, since Bharat and Henzinger have shown that

edge weights improve the precision [5].
4. We also merge nodes in our vicinity graph that have a large number of duplicate links. Duplicate nodes are not such a serious problem when using the HITS or algorithms to rank

query results, since the start set consists of a large number of URLs. However, when forming

the vicinity graph starting with just a single URL, the in uence of duplicate nodes is increased

because duplicate nodes with a large number of out links will quickly dominate the hub and

authority computation.

Kleinberg also showed that HITS computes the principal eigenvector of the matrix AA

, where

A is the adjacency matrix of the above graph, and suggested using non-principal eigenvectors for

nding related pages. Finally, he gave anecdotal evidence that HITS might work well.

Consecutively, a sequence of papers [10, 8] presented improvements on HITS and used it to

populate a given hierarchy of categories. These improvements are not directly relevant to the task

of nding related pages.

6 Conclusion

We have presented two di erent algorithms for nding related pages in the WWW. They significantly outperform Netscape's algorithm for nding related pages. The algorithms can be implemented eciently and are suitable for use within a large scale web service providing a related pages

feature.

Our two algorithms can be extended to handle more than one input URL. In this case, the

algorithms would compute pages that are related to all input URLs. We are currently exploring

these extensions.

Acknowledgements

This work has bene ted greatly from discussions we have had with Krishna Bharat, Andrei Broder,

Puneet Kumar, and Hannes Marais. We are also indebted to Puneet for his work on the Connectivity Server. As some of the earliest users of the server, Puneet answered our many questions and

implemented many improvements to the server in response to our suggestions. We are also grateful

13


<- Previous | First | Next ->