In this section, we introduce PageGather. Given a large access log, our task is to find collections of pages that tend to co-occur in visits. Clustering (see [22,16,24]) is a natural technique to consider for this task. In clustering, documents are represented in an N-dimensional space (for example, as word vectors). Roughly, a cluster is a collection of documents close to each other and relatively distant from other clusters. Standard clustering algorithms partition the documents into a set of mutually exclusive clusters.
Cluster mining is a variation on traditional clustering that is well suited to our task. Instead of attempting to partition the entire space of documents, we try to find a small number of high quality clusters. Furthermore, whereas traditional clustering is concerned with placing each document in exactly one cluster, cluster mining may place a single document in multiple overlapping clusters. The relationship between traditional clustering and cluster mining is parallel to that between classification and data mining as described in . Segal contrasts mining ``nuggets'' -- finding high-accuracy rules that capture patterns in the data -- with traditional classification -- classifying all examples as positive or negative -- and shows that traditional classification algorithms do not make the best mining algorithms.
The PageGather algorithm uses cluster mining to find collections of related pages at a web site, relying on the visit-coherence assumption. In essence, PageGather takes a web server access log as input and maps it into a form ready for clustering; it then applies cluster mining to the data and produces candidate index-page contents as output. The algorithm has five basic steps:
1. Process the access log into visits. As defined above, a visit is an ordered sequence of pages accessed by a single user in a single session. An access log, however, is a sequence of page views, or requests made to the web server. Each request typically includes the time of the request, the URL requested, and the machine from which the request originated. For our purposes, however, we need to extract discrete visits from the log. We first assume that each originating machine corresponds to a single visitor. A series of page views in a day's log from one visitor, ordered by their time-stamps, corresponds to a single session for that visitor.
2. Compute the co-occurrence frequencies between pages and create a similarity matrix. For each pair of pages p1 and p2, we compute P(p1|p2), the probability of a visitor visiting p1 if she has already visited p2 and P(p2|p1), the probability of a visitor visiting p2 if she has already visited p1. The co-occurrence frequency between p1 and p2 is the minimum of these values.
We use the minimum of the two conditional probabilities to avoid mistaking an asymmetrical relationship for a true case of similarity. For example, a popular page p1 might be on the most common path to a more obscure page p2. In such a case P(p1|p2) will be high, perhaps leading us to think the pages similar. However, P(p2|p1) could be quite low, as p1 is on the path to many pages and p2 is relatively obscure.
As stated above, our goal is to find clusters of related but currently unlinked pages. Therefore, we wish to avoid finding clusters of pages that are already linked together. We prevent this by setting the matrix cell for two pages to zero if they are already linked in the site.
We observed that the similarity matrix could be viewed as a graph, which enables us to apply graph algorithms to the task of identifying collections of related pages. However, a graph corresponding to the similarity matrix would be completely (or almost completely) connected. In order to reduce noise, we apply a threshold and remove edges corresponding to low co-occurrence frequency.
|1||Images of Korg keyboards||yes||somewhat|
|2||Images of Korg keyboards||somewhat||somewhat|
|10||Books, do-it-yourself info||somewhat||yes|
3. Create the graph corresponding to the matrix, and find maximal cliques (or connected components) in the graph. We create a graph in which each page is a node and each nonzero cell in the matrix is an arc. Next, we apply graph algorithms that efficiently extract connectivity information from the graph (e.g., the linear-time algorithm for identifying connected components). The frequency information on the arcs is ignored in this step for the sake of efficiency. By creating a sparse graph, and using efficient graph algorithms for cluster mining, we can identify high quality clusters substantially faster than by relying on traditional clustering methods . We may extract two kinds of subgraphs, leading to two variants of the PageGather algorithm.
4. Rank the clusters found, and choose which to output. Step (3) may find many clusters, but we may wish to output only a few. For example, the site's webmaster might want to see no more than a handful of clusters every week and decide which to turn into new index pages. Accordingly, all clusters found are rated and sorted by averaging the co-occurrence frequency between all pairs of documents in the cluster. We find that PGCLIQUE tends to discover many similar clusters. Therefore, we have developed two ways of reducing the number of similar clusters in the final results.
However we process the ranked list of clusters, we return a bounded number of results, and apply a quality threshold - sometimes returning no results at all. The webmaster specifies this bound -- e.g. ``give me at most ten clusters above quality threshold X...''.
5. For each cluster found, create a web page consisting of links to the documents in the cluster, and present it to the webmaster for evaluation. The PageGather algorithm finds candidate link sets and presents them to the webmaster as in figure 2. The webmaster is prompted to accept or reject the cluster, to name it, and remove any links she thinks inappropriate. Links are labeled with the titles of the target pages and ordered alphabetically by those titles. The webmaster is responsible for placing the new page at the site.