2. Contract duplicates and near-duplicates in this graph.
3. Compute edge weights based on host to host connections.
4. Compute a score and an score for each node in the graph and return the top
ranked authority nodes (our implementation returns the top 10). This phase of the algorithm
uses a modied version of the HITS algorithm originally proposed by Kleinberg [17].
These steps are described in more detail in the subsections below. Only step 1 exploits the order
of links on a page.
2.1.1 Step 1: Building the Vicinity Graph
Given a query URL u we build a directed graph of nodes that are nearby to u in the web graph.
Graph nodes correspond to web pages and graph edges correspond to hyperlinks. The graph consists
of the following nodes (and the edges between these nodes):
1. u,
Here is how we choose these nodes in detail: There is a stoplist STOP of URLs that are unrelated
to most queries and that have very high indegree. Our stoplist was developed by experimentation,
and currently contains 21 URLs. Examples of nodes that appear on our stoplist are www.yahoo.com
and www.microsoft.com/ie/download.html. If the starting URL u is not one of the nodes on our
stoplist, then we ignore all the URLs on the stoplist when forming the vicinity graph. If u does
appear on the stoplist, however, then we disable the stoplist (i.e. set STOP to the empty list) and
freely include any nodes in the vicinity graph. We disable the stoplist when the input URL appears
on the stoplist because many nodes on the stoplist are popular search engines and portals, and we
want to permit these nodes to be considered when the input URL is another such popular site.
: If u has more than B parents, add B random parents not
on STOP to the graph; otherwise add all of u's parents. If a parent x of u has more than BF + 1
children, add up to BF=2 children pointed to by the BF=2 links on x immediately preceding the
link to u and up to BF=2 children pointed to by the BF=2 links on x immediately succeeding the
link to u (ignoring duplicate links). If page x has fewer than BF children, we add all of its children
to the graph. Note that this exploits the order of links on page x.
: If u has more than F children, add the children pointed
to by the rst F links of u; otherwise, add all of u's children. If a child of u has more than BF
parents, add the BF parents not on STOP with highest indegree; otherwise, add all of the child's
parents.
If there is a hyperlink from a page represented by node v in the graph to a page represented by
node w and v and w do not belong to the same host, then there is a directed edge from v to w in
the graph (we omit edges between nodes on the same host).
In our experience, we have found that using a large value for B (2000) and a small value for BF
(8) works better in practice than using moderate values for each (say, 50 and 50). We have observed
that links to pages on a similar topic tend to be clustered together, while links that are farther
apart on a page are less likely to be on the same topic (for example, most hotlists are grouped into
categories). This has also been observed by other researchers [9]. Using a larger value for B also
means that the likelihood of the computation being dominated by a single parent page is reduced.
4