The limitations of our quality measure suggest a natural direction for future work. So far, we have focused on finding clusters of frequently co-occurring pages. While this approach can produce useful index pages, it does not address issues of purity or completeness. For example, when presented with a page titled ``electric guitars,'' the typical user would expect the set of links provided to be pure -- containing only links to guitars -- and complete -- containing links to all guitars at the site. Purity and completeness are analogous, respectively, to the criteria of precision and recall from information retrieval. IR systems are often evaluated in terms of their precision and recall with respect to a labeled data collection; human judges decide which objects match a particular query, and the system is rated on how closely its results accord with the human judges. Precision and recall may be a better metric for evaluating PageGather, but would require hand labeling of many examples -- for each cluster found, we would have to judge what topic the cluster corresponds to and what set of pages actually belong in the cluster.
Instead, our goal is to extend our current approach to automatically identify topics and find pure and complete sets of relevant pages. We plan to use the candidate link sets generated by PageGather as a starting point, mapping them to the closest pure and complete topics. There are two ways to make the notion of a ``topic'' available to PageGather. First, if we have an extensional definition of each potential topic at the site as a set of links, then identifying the topic closest to a PageGather-generated link set is straightforward. Alternatively, if we have a predicate language for describing the different pages at the site (in XML or a-la-STRUDEL ), then we can apply concept learning techniques  to generate a description of the PageGather link set in that language. We would view the PageGather links as positive examples of the target concept, and links outside the set as negative examples, and apply a noise-tolerant symbolic learning technique (such as decision trees) to output a description of the topic most closely matching the PageGather link set.
This mapping from candidate link sets to topics is likely to decrease our statistical measure, but we realize that the measure is only a rough approximation of ``true'' index page quality. We plan to investigate alternative measures and also to carry out user studies (with both Web site visitors and webmasters) to assess the impact of the suggested adaptations on users in practice.
Index pages also have at least two common uses: as a summary of information on a particular topic (useful to a visitor wanting to get an overview of that topic) and as a directory of specific resources (useful to a visitor with a specific goal). These different uses have different requirements, which may conflict. For example, for the first usage, a few of the most important links may be sufficient -- better to make sure all links are topical than to include something irrelevant. For the second, a complete listing is essential, even if some of the included links are only marginally relevant. We have focused on the first usage, but a good index should be able to support both.
Finally, our work thus far has focused on a single web site for convenience. We plan to test our approach on additional web sites, including our department's web site, in the near future.