Next: The PageGather Algorithm
Up: A Case Study: Index
Previous: A Case Study: Index
Page synthesis is the automatic creation of web pages. An
index page is a page consisting of links to a set of pages that cover
a particular topic (e.g., electric guitars). Given this terminology
we define the index page synthesis problem: given a web site and
a visitor access log, create new index pages containing collections of
links to related but currently unlinked pages.
An access log is a document containing one entry
for each page requested of the web server. Each request lists at
least the origin (IP address) of the request, the URL requested, and
the time of the request. Related but unlinked pages are pages
that share a common topic but are not currently linked at the site;
two pages are considered linked if there exists a link from one to the
other or if there exists a page that links to both of them.
The problem of synthesizing a new index page can be decomposed into
several subproblems.
- 1.
- What are the contents (i.e. hyperlinks) of the index page?
- 2.
- How are the hyperlinks on the page ordered?
- 3.
- How are the hyperlinks labeled?
- 4.
- What is the title of the page? Does it correspond to a coherent concept?
- 5.
- Is it appropriate to add the page to the site? If so, where?
In this paper, we focus on the first subproblem -- generating the
contents of the new web page. The remaining subproblems are
topics for future work. We note that several subproblems,
particularly the last one, are quite difficult and will be solved in
collaboration with the site's human webmaster. Nevertheless, we show
that the task of generating candidate index page contents can be
automated with some success using the PageGather algorithm described below.
Next: The PageGather Algorithm
Up: A Case Study: Index
Previous: A Case Study: Index
Mike Perkowitz
1999-03-02