The PageGather Algorithm

In this section, we introduce *PageGather*. Given a large
access log, our task is to find collections
of pages that tend to co-occur in visits. Clustering (see
[22,16,24]) is a
natural technique to consider for this task. In clustering, documents are
represented in an N-dimensional space (for example, as word vectors).
Roughly, a cluster is a collection of documents close to each other and
relatively distant from other clusters. Standard clustering algorithms
*partition* the documents into a set of mutually exclusive clusters.

Cluster *mining* is a variation on traditional clustering that is
well suited to our task. Instead of attempting to partition the
entire space of documents, we try to find a small number of high
quality clusters. Furthermore, whereas traditional clustering is
concerned with placing each document in exactly one cluster, cluster mining
may place a single document in multiple overlapping clusters. The
relationship between traditional clustering and cluster mining is
parallel to that between classification and data mining as described
in [20]. Segal contrasts mining ``nuggets'' -- finding
high-accuracy rules that capture patterns in the data -- with
traditional classification -- classifying *all* examples as
positive or negative -- and shows that traditional classification
algorithms do not make the best mining algorithms.

The *PageGather algorithm* uses cluster mining to find collections
of related pages at a web site, relying on the visit-coherence
assumption. In essence, PageGather takes a web server access log as
input and maps it into a form ready for clustering; it then applies
cluster mining to the data and produces candidate index-page contents
as output. The algorithm has five basic steps:

- 1.
- Process the access log into visits.
- 2.
- Compute the co-occurrence frequencies between pages and create a similarity matrix.
- 3.
- Create the graph corresponding to the matrix, and find maximal cliques (or connected components) in the graph.
- 4.
- Rank the clusters found, and choose which to output.
- 5.
- For each cluster, create a web page consisting of links to the documents in the cluster, and present it to the webmaster for evaluation.

**1. Process the access log into visits.**
As defined above, a visit is
an ordered sequence of pages accessed by a single user in a single session.
An access log, however, is a sequence of page views, or requests made to the
web server.^{}
Each request typically includes the time of the request,
the URL requested, and the machine from which the request originated.
For our purposes, however, we need to extract discrete visits from the log.
We first assume that each
originating machine corresponds to a single visitor.^{}
A series of page views in a day's log from one visitor,^{}
ordered by their time-stamps, corresponds
to a single session for that visitor.

**2. Compute the co-occurrence frequencies between pages
and create a similarity matrix.**
For each pair of pages *p _{1}* and

We use the minimum of the two conditional probabilities to avoid mistaking
an asymmetrical relationship for a true case of similarity.
For example, a popular page *p _{1}* might be on the most common path to
a more obscure page

As stated above, our goal is to find clusters of related
*but currently unlinked* pages. Therefore, we wish to avoid
finding clusters of pages that are already linked together. We
prevent this by setting the matrix cell for two pages to zero
if they are already linked in the site.

We observed that the similarity matrix could be viewed as a graph, which enables us to apply graph algorithms to the task of identifying collections of related pages. However, a graph corresponding to the similarity matrix would be completely (or almost completely) connected. In order to reduce noise, we apply a threshold and remove edges corresponding to low co-occurrence frequency.

Rank | Topic | Coherent? | Useful? |

1 | Images of Korg keyboards | yes | somewhat |

2 | Images of Korg keyboards | somewhat | somewhat |

3 | Sequencers | yes | yes |

4 | Sequencers | yes | yes |

5 | Roland Instruments | no | no |

6 | Samplers | yes | yes |

7 | Instrument samples | yes | somewhat |

8 | Samplers | yes | yes |

9 | Instrument samples | yes | somewhat |

10 | Books, do-it-yourself info | somewhat | yes |

**3. Create the graph corresponding to the matrix, and find maximal cliques
(or connected components) in the graph.**
We create a graph in which each page is a node and each nonzero cell in the
matrix is an arc.
Next, we apply graph algorithms that efficiently extract connectivity
information from the graph (e.g., the linear-time algorithm for identifying
connected components). The frequency information on the arcs is
ignored in this step for the sake of efficiency.
By creating a sparse graph, and using efficient graph algorithms
for cluster mining, we can identify high quality clusters
substantially faster than by relying on traditional
clustering methods [15].
We may extract two kinds of subgraphs, leading to two variants
of the PageGather algorithm.

*PG*_{CLIQUE}finds all*maximal cliques*in the graph. A clique is a subgraph in which every pair of nodes has an edge between them; a maximal clique is not a subset of any larger clique. For efficiency, we bound the size of discovered cliques.^{}*PG*_{CC}finds all*connected components*-- subgraphs in which every pair of pages has a path of edges between them.

**4. Rank the clusters found, and choose which to output.**
Step (3) may find many clusters, but we may wish to
output only a few. For example, the site's webmaster might want to
see no more than a handful of clusters every week and decide which
to turn into new index pages. Accordingly, all clusters found are
rated and sorted by averaging the co-occurrence frequency between
all pairs of documents in the cluster.
We find that *PG*_{CLIQUE} tends to discover many similar clusters. Therefore,
we have developed two ways of reducing the number of similar clusters
in the final results.

**Overlap reduction**proceeds through the ranked cluster list and removes any cluster that overlaps highly with a previously seen (i.e. better) cluster.**Merging**walks the ranked list of clusters and, whenever a cluster has sufficient overlap with a previously seen cluster, merges the two and continues.

However we process the ranked list of clusters, we return
a bounded number of results, and apply a quality threshold -
sometimes returning no results at all. The webmaster specifies
this bound -- e.g. ``give me at most ten clusters above quality
threshold *X*...''.

**5. For each cluster found, create a web page consisting of links
to the documents in the cluster, and present it to the
webmaster for evaluation.**
The PageGather algorithm finds candidate link sets and presents them
to the webmaster as in figure 2. The webmaster is prompted to
accept or reject the cluster, to name it, and remove any links she
thinks inappropriate. Links are labeled with the titles of the
target pages and ordered alphabetically by those titles. The webmaster
is responsible for placing the new page at the site.