<- Previous | First | Next ->

2.2 Cocitation Algorithm

An alternative approach for nding related pages is to examine the siblings of a starting node u

in the web graph. Two nodes are if they have a common parent. The number of common

parents of two nodes is their . As an alternative to the Companion algorithm, we

have developed a very simple algorithm that determines related pages by looking for sibling nodes

with the highest degree of co-citation. The Cocitation algorithm rst chooses up to B arbitrary

parents of u. For each of these parents p, it adds to a set S up to BF children of p that surround

the link from p to u. The elements of S are siblings of u. For each node s in S, we determine

the degree of cocitation of s with u. Finally, the algorithm returns the 10 most frequently cocited

nodes in S as the related pages.

In some cases there is an insucient level of cocitation with u to provide meaningful results. In

our implementation, if there are not at least 15 nodes in S that are cocited with u at least twice,

then we restart the algorithm using the node corresponding to u's URL with one path element

removed. For example, if u's URL is a.com/X/Y/Z and an insucient number of cocited nodes

exist for this URL, then we restart the algorithm with the URL a.com/X/Y (if the resulting URL

is invalid, we continue to chop elements until we are left with just a host name, or we nd a valid

URL).

In our implementation, we chose B to be 2000 and BF to be 8 (the same parameter values we

used for our implementation of the Companion algorithm).

One way of looking at the Cocitation algorithm is that it nds \maximal" n X 2 bipartite

subgraphs in the vicinity graph.
2.3 Netscape's Approach

Netscape introduced a new \What's Related?" feature in version 4.06 of the Netscape Communicator browser. Details about the approach used to identify related pages in their algorithm are

sketchy. However, the What's Related FAQ page indicates that the algorithm uses connectivity

information, usage information, and content analysis of the pages to determine relationships. We

quote from the \What's Related" FAQ page:

The What's Related data is created by Alexa Internet. Alexa uses crawling, archiving, categorizing and data mining techniques to build the related sites lists for millions of web URLs.

For example, Alexa uses links on the crawled pages to nd related sites. The day-to-day use of

What's Related also helps build and re ne the data. As the service is used, the requested URLs

are logged. By looking at high-level trends, Netscape and Alexa can deduce relationships between

web sites. For example, if thousands of users go directly from site A to site B, the two sites are

likely to be related.

Next, Alexa checks all the URLs to make sure they are live links. This process removes links that

would try to return pages that don't exist (404 errors), as well as any links to servers that aren't

available to the general Internet population, such as servers that are no longer active or are

behind rewalls. Finally, once all of the relationships are established and the links are checked,

the top ten related sites for each URL are chosen by looking at the strength of the relationship

between the sites.

Each month, Alexa recrawls the web and rebuilds the data to pull in new sites and to re ne

the relationships between the existing sites. New sites with strong relationships to a site will

automatically appear in the What's Related list for that site by displacing any sites with weaker

relationships.

Note that since the relationships between sites are based on strength, What's Related lists are not

necessarily balanced. Site A may appear in the list for Site B, but Site B may not be in the list

6


<- Previous | First | Next ->