As noted in the introduction, in previous research in learning wrappers, a new wrapper has been trained for each new page format, using examples specific to that page. One possible way of training such a wrapper-induction system is the following. The user first labels the first few items that should be extracted from the list starting from the top of the page. These are assumed to be a complete list of items to be extracted up to this point; that is, it is assumed that any unmarked text preceding the last marked item should not be extracted. The learning system then learns a wrapper from these examples, and uses it to extract data from the remainder of the page. The learned wrapper can also be used for other pages with the same format as the page used in training.
For simple lists and hotlists, the approach outlined above in Section can also be used to learn page-specific wrappers. The only difference is in the way that a dataset is constructed, and the circumstances in which the learned wrapper is used; in learning a page-specific wrapper, all training examples come from the page p being wrapped, and the learned classifier is only used to label parse tree nodes from p (or other pages with the same format as p).
We performed the following experiment to test the effectiveness of such a learning system. For a wrapper/page pair wi,pi, we used wi and pi to build a labeled parse tree. Then, we traversed the tree in a left-to-right, depth-first order, collecting all nodes up to and including the K-th node with a positive label into a sample Si,K; this simulates asking the user to label the first few sections of text to be extracted. We then trained a wrapper in the usual way on the sample Si,K, and tested its performance on the remaining nodes. We repeated this procedure for all 84 wrapper/page pairs, and for K=1,...,20.
The results are shown in Figure . In each plot, we place a mark at row K and column i if the tested performance of the classifier trained on problem i with K positive examples exceeds some performance threshold; in the bottom plot, a mark is placed if performance is perfect, and in the top plot, a mark is placed if performance is e -good for e = 5%. Also, a vertical line is placed at column i if the corresponding page-independent classifier exceeds the performance threshold; that is, if the extraction heuristic learned from all problems other than problem i had acceptable performance on problem i.
The figure illustrates a number of interesting points. One surprising point is that many of the extraction tasks handled correctly by the page-independent learned heuristics are difficult to learn using examples taken from the specific page being wrapped. For instance, problems 32, 65 and 72 are handled perfectly with the page-independent extractor; however, extraction performance with examples from these pages alone is not e -good, even with 20 positive examples. Thus the technique of learning page-independent extraction heuristics seem to be somewhat complementary to the technique of learning page-specific wrappers.
Another point is that, while labeling more examples does clearly help on average, performance improves erratically on many specific problems. In problems 29 and 30, for example, the learner appears to oscillate between performance that is perfect and performance that is not even e -good. Because the amount of training data on these problems is small, adding even a single new positive example--together with all negative examples that precede it--can change the statistics of the training data substantially.
Based on these observations, we evaluated the performance of two possible methods for learning page-specific wrappers. In applying the pure intra-page method to problem i, the user is repeatedly prompted for the next positive example, until the wrapper learned from the existing dataset meets some performance goal. At this point learning stops. We simulated the intra-page method experimentally for two performance goals (perfection, and e -goodness) using the depth-first left-to-right tree traversal discussed above to simulate user input. The results are shown in the two curves labeled ``intra-page'' in Figure ; for each value K, we show the percentage of benchmark problems that meet the performance goal with K or fewer examples.
In the hybrid method, the system first learns a page-independent classifier from all pages other than page i, and then attempts to wrap that page with the learned extraction heuristics. If the user accepts this wrapper--in simulation, if the performance goal is reached--then learning stops. Otherwise, the user is repeatedly prompted for the next positive example, as in intra-page learning, until the learned wrapper meets the performance goal. The results are shown in the two curves labeled ``hybrid'' in Figure ; however, for each value K, we show the percentage of benchmark problems that meet the performance goal with K or fewer user inputs, rather than K or fewer examples. The first ``user input'' is the acceptance or rejection of the page-independent wrapper--this query is thus charged equally with labeling an example.
The graph shows that the hybrid method offers a substantial advantage over the pure intra-page method; in many situations, one would have to label double or triple the number of examples to obtain a comparable same set of wrappers. For instance, one can obtain perfect wrappers for 60% of the problems with only 6 user inputs for the hybrid system, while for the intra-page system, 15-20 labeled examples are necessary; also, the hybrid system gets acceptable performance on 80% of the benchmarks after seeing only 6 training examples, while the intra-page system requires 12 training examples to do as well.
Similar results were obtained with CART (not shown), except that CART's performance was worse for very small values of K.