In the leave-one-page-out experiment, when extraction performance is tested on a page pi, the learned extraction program has no knowledge whatsoever of pi itself. However, it may well be the case that the learner has seen examples of only slightly different pages--for instance, pages from the same Web site, or Web pages from different sites that present similar information. So it is still possible that the learned extraction heuristics are to some extent specialized to the benchmark problems from which they were generated, and would work poorly in a novel application domain.
We explored this issue in several ways. The 84 benchmark problems we consider were taken from four different demonstrations of WHIRL: one integrating information on North American birds (birds), one concerning computer games for children (games), one concerning movies and movie reviews (movies), and one concerning news stories and company information (news). In the middle section of Table , we give the performance in the leave-one-page-out experiments in each individual domain. Performance seems to be roughly comparable7 on all domains, a first indication that the learned extraction heuristics are not highly domain-specific.
We explored this issue by conducting two variants of the leave-one-page-out experiment. The first variant is a ``leave-one-domain-out'' experiment. Here we group the pages by domain, and for each domain, test performance of the extraction heuristics obtained by training on the other three domains. If the extraction heuristics were domain-specific, then one would expect to see markedly worse performance; in fact, the performance degrades only slightly. (Note also that less training data is available in the ``leave-one-domain-out'' experiments, another possible cause of degraded performance.) These results shown in the leftmost section of Table .
The second variant is presented in the rightmost section of Table , labeled as the ``intra-domain leave-one-page-out'' experiment. Here we again group the pages by domain, and perform a separate leave-one-page-out experiment for each domain. Thus, in this experiment the extraction heuristics tested for page pi are learned from only the most similar pages--the pages from the same domain. In this variant, one would expect a marked improvement in performance if the learned extraction heuristics were very domain- or site-specific. In fact, there is little change. These experiments thus support the conjecture that the learned extraction are in fact quite general.
We also explored using classification learners other than RIPPER. Table shows the results for the same set of experiments using CART, a widely used decision tree learner.8 CART achieves performance generally comparable to RIPPER. We also explored using C4.5  and an implementation of Naive Bayes; however, preliminary experiments suggested that their performance was somewhat worse than both RIPPER and CART.