We have described a method for learning general, page-independent heuristics for extracting data from (``wrapping'') HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is a general, page-independent heuristic procedure for extracting data. Page formats that are correctly ``wrapped'' by the learned heuristics can be incorporated into a knowledge base with minimal human effort; it is only necessary to indicate where in the knowledge base the extracted information should be stored. In contrast, other wrapper-induction methods require a human teacher to train them on each new page format.
More specifically, we defined two common types of extraction problems--extraction of simple lists and simple hotlists--which together comprise nearly 75% of the wrappers required in experiments with WHIRL. We showed that learning these types of wrappers can be reduced to a classification problem. Using this reduction and standard-off-the shelf learners, we were able to learn extraction heuristics which worked perfectly on about 30% of the problems in a large benchmark collection, and which worked well about about 50% of the problems. The extraction heuristics were also demonstrated to be domain-independent. Finally, we showed that page-independent extraction heuristics are complementary to more traditional methods for learning wrappers, and that a simple combination of these methods can substantially improve the performance of a page-specific wrapper learning method.
Ashish and Knoblock  propose heuristics for detecting hierarchical structure in an HTML document; obtaining this structure can facilitate programming a wrapper. These heuristics were designed manually, rather than acquired by learning as in our system.
One of the authors of this paper has also evaluated certain hand-coded heuristic methods for detecting lists and hotlists in HTML pages . Briefly, this work considers a very restricted class of possible wrapper programs--one chosen by careful analysis of the 84 extraction problems used in this study. Given an HTML page, it is possible to enumerate all wrapper programs in the restricted class, and then rank the enumerated wrappers according to various heuristic measures. Cohen's results show that some natural ranking heuristics perform relatively poorly on this task: for instance, the wrapper that extracts the longest list is correct only 18% of the time. However, more complex heuristics, encoded in a special-purpose logic, perform as well as or better than the learning approach discussed here. An advantage of the learning approach (relative to the ranking approach ) is that the learned heuristics are obtained without manual engineering, and hence can be more readily adapted to variations of the extraction problem. The learning approach is also applicable to a broader class of extraction programs.
Earlier work on automatic extraction of data from documents include heuristic methods for recognizing tables given a bitmap-level representation of a document . These methods rely on finding rows and columns of white space in the bitmap; and would seem to be relatively expensive to apply directly to HTML documents. Our method also has the advantage that it can extract information not clearly visible to the human reader, such as URLs that appear only as attribute values in the HTML. The heuristics devised by Rus and Subramanian were also designed manually, rather than acquired by learning.
The results of this paper raise a number of questions for further research. The breadth of the system could be improved by devising labeling schemes for other sorts of extraction tasks. Performance on lists and hotlists could perhaps be improved by using more powerful learning methods, or learning methods that exploit more knowledge about this specific learning task. Finally, it might be useful to couple these automatic extraction methods with methods that automatically determine if a Web page contains a list or hotlist, and methods that automatically associate types with the extracted data. This combination of methods could further lower the cost of fielding an information integration system for a new domain.