Designing a rich web site so that it readily yields its information can be tricky. Unlike the proverbial oyster that contains a single pearl, a web site often contains myriad facts, images, and hyperlinks. Many different visitors approach a popular web site -- each with his or her own goals and concerns. Consider, for example, the web site for a typical computer science department. The site contains an amalgam of research project descriptions, course information, lists of graduating students, pointers to industrial affiliates, and much more. Each nugget of information is of value to someone who would like to access it readily. One might think that a well organized hierarchy would solve this problem, but we've all had the experience of banging our heads against a web site and crying out ``it's got to be here somewhere...''.
The problem of good web design is compounded by several factors. First, different visitors have distinct goals. Second, the same visitor may seek different information at different times. Third, many sites outgrow their original design, accumulating links and pages in unlikely places. Fourth, a site may be designed for a particular kind of use, but be used in many different ways in practice; the designer's a priori expectations may be violated. Too often web site designs are fossils cast in HTML, while web navigation is dynamic, time-dependent, and idiosyncratic. In , we challenged the AI community to address this problem by creating adaptive web sites: sites that semi-automatically improve their organization and presentation by learning from visitor access patterns.
Many web sites can be viewed as user interfaces to complex information stores. However, in contrast to standard user interfaces, where data on user behavior has to be gathered in expensive (and artificial) focus groups and usability labs, web server logs automatically record user behavior at the site. We posit that adaptive web sites could become a valuable method of mining this data with the goal of continually tuning the site to its user population's needs.
While adaptive web sites are a potentially valuable, their feasibility is unclear a priori: can non-trivial adaptations be automated? will adaptive web sites run amok, yielding chaos rather than improvement? what is an appropriate division of labor between the automated system and the human webmaster? To investigate these issues empirically, we analyze the problem of index page synthesis. We focus on one subproblem (generating candidate link sets to include in index pages) as amenable to automation and describe the PageGather algorithm, which solves it.
The remainder of this paper is organized as follows. We next discuss the design space of adaptive web sites and present previous work in this area. We then present design desiderata which motivate our own approach. In section 2, we define the index page synthesis problem, the focus of our case study. We then present PageGather, analyzing variants of PageGather and both data mining and clustering algorithms as potential solutions. In section 3, we experimentally evaluate variants of PageGather, and compare the performance of PageGather to that of Apriori, the classical data mining algorithm for the discovery of frequent sets . We also compare PageGather's output to pre-existing, human-authored index pages available at our experimental web site. We conclude with a discussion of future work and a summary of our contributions.