Finding Related Pages in the World Wide Web
Jerey Dean
Monika R. Henzinger
mySimon, Inc.
Compaq Systems Research Center
130 Lytton Ave.
Santa Clara, CA
Palo Alto, CA 94301
jdean@mysimon.com monika@pa.dec.com
Abstract
When using traditional search engines, users have to formulate queries to describe their
information need. This paper discusses a dierent approach to web searching where the input
to the search process is not a set of query terms, but instead is the URL of a page, and the output
is a set of related web pages. A related web page is one that addresses the same topic as the
original page. For example, www.washingtonpost.com is a page related to www.nytimes.com,
since both are online newspapers.
We describe two algorithms to identify related web pages. These algorithms use only the
connectivity information in the web (i.e., the links between pages) and not the content of pages
or usage information. We have implemented both algorithms and measured their runtime performance. To evaluate the eectiveness of our algorithms, we performed a user study comparing
our algorithms with Netscape's \What's Related" service [12]. Our study showed that the
precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape,
despite the fact that Netscape uses both content and usage pattern information in addition to
connectivity information.
Keywords: search engines, related pages, searching paradigms.
1 Introduction
Traditional web search engines take a query as input and produce a set of (hopefully) relevant
pages that match the query terms. While useful in many circumstances, search engines have the
disadvantage that users have to formulate queries that specify their information need, which is
prone to errors. This paper discusses how to nd related web pages, a dierent approach to web
searching. In our approach the input to the search process is not a set of query terms, but the URL
of a page, and the output is a set of related web pages. A related web page is one that addresses the
same topic as the original page, but is not necessarily semantically identical. For example, given
www.nytimes.com, the tool should nd other newspapers and news organizations on the web. Of
course, in contrast to search engines, our approach requires that the user has already found a page
of interest.
Recent work in information retrieval on the web has recognized that the hyperlink structure
can be very valuable for locating information [18, 3, 7, 23, 19, 25, 24, 6, 17, 5]. This assumes that
if there is a link from page v and w, then the author of v recommends page w, and links often
connect related pages. In this paper, we describe the Companion and Cocitation algorithms, two
algorithms which use only the hyperlink structure of the web to identify related web pages. For
example, Table 1 shows the output of the Companion algorithm when given www.nytimes.com as
*This work was done while the author was at the Compaq Western Research Laboratory.
1