A number of recent systems operate by taking information from the Web, storing it in some sort of knowledge base, and then allowing a user to query that knowledge base [14,7,11,8,13,15,19]. One bottleneck in building such an information integration system is developing ``wrappers''--programs that convert Web pages into an appropriate format for the knowledge base. Because data can be presented in many different ways on the Web, and Web pages frequently change format, building and maintaining these wrappers is often time-consuming and tedious.
A number of proposals have been made for reducing the cost of building wrappers. Data exchange standards like XML have promise; unfortunately, XML is not yet widely used, and one might expect that Web information sources using ``legacy'' formats like HTML will be common for some time. Some researchers have proposed special languages for writing wrappers [9,5], or semi-automated tools for wrapper construction . Others have implemented systems that allow wrappers to be trained from examples [12,10,16]. Although languages and learning methods for wrapper construction are useful, they do not entirely eliminate the human effort involved in ``wrapping'' a Web site; for example, if learning is used, it is still necessary for a human to label the examples given to the learner.
Most of the data extracted by wrappers is originally encoded in HTML, and is very regular and repetitive in format: generally, the pages being wrapped are well-structured tables and lists. An alternative research program would be to develop general, page-independent, heuristics for recognizing (and extracting data from) tables and lists in HTML documents. However, developing general-purpose, reliable, heuristics for table- and list-recognition is non-trivial, as users often do not implement tables and lists using the appropriate HTML constructs (e.g., <table>, <ul>, <dl>). Furthermore, any such heuristics might well require substantial effort to maintain as HTML and conventions for using it continue to evolve.
In this paper, we seek to learn general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is a general, page-independent procedure for extracting data--a procedure that works for many formats and many pages. New pages that are correctly wrapped by the learned procedure can be incorporated into a knowledge base with minimal human effort; it is only necessary to indicate where in the knowledge base the extracted information should be stored. Our method thus differs from earlier methods for learning wrappers, in which the goal of learning was a wrapper for pages with a single specific format, and a new training process is needed for each page format.
Below we will describe our learning method, and evaluate it on a collection of 84 extraction problems encountered in building applications for the information integration system WHIRL [4,5]. We first identify two types of extraction problems, namely extraction of simple lists and simple hotlists; these were the most common types of extraction problems for WHIRL, together comprising about 75% of the implemented wrappers. We then explain how this extraction problem can be reduced to a more conventional classification problem, thus allowing existing learners to be used to learn extraction heuristics. We demonstrate that around 30% of the benchmark problems can be handled perfectly by learned extraction heuristics, and around 50% of the benchmarks can be handled reasonably well. We also show that the learned heuristics are domain-independent.
We also evaluate a hybrid system that combines learned page-independent extraction heuristics with a more conventional wrapper-learning approach, in which the learner is re-trained for each page format. We show that incorporating page-independent heuristics leads to improved performance: for instance, the hybrid system gets acceptable performance on 80% of the benchmarks after seeing only 6 training examples, where as the conventional system requires 12 training examples to do as well.