In this section, we report on experiments designed to test the effectiveness of our approach. We first compare the performance of several variants of our algorithm. We then compare PageGather with the Apriori frequent set algorithm. Finally, we compare PageGather's candidate clusters with human-authored index pages.
Our experiments draw on data collected from the Music Machines web site, a site devoted to information about many kinds of electronic musical instruments. The site contains approximately 2500 distinct documents, including HTML pages, plain text, images, and audio samples. Music machines receives approximately 10,000 hits per day from roughly 1200 distinct visitors. In our experiments, the training data is a collection of access logs for six months; the test data is a set of logs from a subsequent one-month period.
We compare the algorithms in terms of the quality of the candidate index pages they produce. Measuring cluster quality is a notoriously difficult problem. To measure the quality of a cluster as an index page candidate, we need some measure of whether the cluster captures a set of pages that are viewed by users in the same visit. If so, then grouping them together on an index page would save the user the trouble of traversing the site to find them. As an approximate measure we ask: if a user visits any one page in the cluster, what percentage of pages in the cluster is she likely to visit overall? More formally, let V be the set of user visits to the site and Vc be the set of that include at least one page from cluster c. For a particular visit , the number of pages visited in c is . The average number of hits to a cluster, over all visits, is . Finally, the average percentage is this average divided by the size of c: . In each experiment, each algorithm outputs a ranked list of clusters. In all graphs, these clusters are sorted by average visit percentage for the sake of clarity.
As described in section 2.2, the two PageGather variants PGCLIQUE and PGCC differ in how clusters are found in the graph representation. PGCLIQUE finds all maximal cliques; PGCC finds all connected components in the graph. In our first experiment, we compare these two variants of the algorithm, shown in figure 4. Note that the clusters found by PGCLIQUE apparently perform much better than those found by PGCC. When we examine the clusters closely, however, we find the top clusters found by PGCLIQUE to be very similar to each other -- PGCLIQUE tends to produce many slight variations on the same basic cluster. The reason for this is that there may exist a well-connected set of pages that does not, however, form a clique. Many subsets, however, are cliques, and the algorithm returns these many similar subsets as clusters. If we decrease the co-occurrence threshold, allowing more pages to be connected by edges in our graph representation, one such set of pages may become a clique. However, there will generally always be some borderline sets that exhibit this behavior. This observation was the motivation for introducing overlap reduction and merging as described in section 2.2. In figure 5, we compare PGCC with PGCLIQUE using overlap reduction. Note that PGCC is unaffected by the reduction step, as connected components of a graph never overlap. We see that the two versions of the algorithm now perform comparably. However, PGCLIQUE has a slight edge, and we will use it for our remaining comparisons.
Earlier, we discussed three ways of processing the ranked list of clusters found by PageGather's clustering component. We may simply return the raw ranked list, we may eliminate overlapping clusters, or we may merge similar clusters together. We have already discussed why the raw list is inadequate. In figure 6, we compare the performance of overlap reduction and merging applied to PGCLIQUE. We might expect the merged clusters to show lower quality scores, as this operation brings together documents that were not associated by PageGather's clustering component, and this is apparent in the performance comparison. On the other hand, the best merged cluster is about three times larger (containing 19 links) than the best from the reduced set (7 links). The smaller cluster is a subset of the larger, which is arguably a more complete cluster; see the discussion in section 4.
In section 2.4, we discussed how the Apriori frequent set algorithm can be applied to the problem of index page synthesis. We apply Apriori to the collection of path data to find the most frequently occurring sets of pages; the most frequently occurring sets are our candidate clusters and are compared to the output of PGCLIQUE. We now compare the performance of Apriori with PGCLIQUE (see figure 7). Initially, we evaluated the raw output of Apriori. However, as with the raw version of PGCLIQUE, the top clusters are all variations of each other. We therefore also show the results of adding a reduction step to Apriori as well; surprisingly, of the thousands of sets found by Apriori, only two prove to be sufficiently distinct. In either case, we observe that PGCLIQUE's clusters score much higher than Apriori's.
It is natural to ask how good these clusters really are, and how high an average visit percentage is high enough. To attempt to answer these questions we look to the ``ideal'' case: index pages created by a human webmaster. Music Machines contains many index pages of different kinds. Expecting all of the site's index pages to score significantly better than PageGather's output, we chose index pages pertaining to four representative topics: (1) instruments produced by a particular manufacturer (Roland); (2) documents pertaining to a particular instrument (the Roland Juno keyboard); (3) all instruments of a particular type (drum machines); and (4) all files of a particular type (audio samples). The set of outgoing links on each page (excluding standard navigation links) was treated as a ``cluster'' and evaluated on our test data. Figure 8 shows the comparison of these clusters to the output of PGCLIQUE with overlap reduction. We believe there are two reasons why PGCLIQUE's clusters perform so much better than the existing human-authored index pages. First, we believe that our cluster mining approach finds genuine regularities in the access logs that carry over from training data to test data. PageGather is geared toward finding these regularities and is apparently successful. a high proportion of links when there are so many on the page. Second, although creating index pages with high click-through rates is desirable, the webmaster of a site has other considerations in mind. For example, a page titled ``drum machines'' should be complete -- that is, it should contain all drum machines at the site, not just those that often co-occur in the logs. Similarly, such a page should be pure -- it should not contain links to guitars, whatever the statistical regularity. These considerations are both motivated by the expectations of human visitors coming to the site, and are not accounted for by our measure. Therefore, while high average visit percentages are promising, they do not tell the whole story.