|Performance Level||# pages reached|
|good ( e =1%)||33/84 39%|
|good ( e =3%)||35/84 39%|
|good ( e =5%)||41/84 49%|
|good ( e =10%)||45/84 54%|
|good ( e =15%)||47/84 56%|
|good ( e =20%)||48/84 57%|
|good ( e =25%)||48/84 57%|
We used the following method to evaluate the learned extraction heuristics. For each wrapper/Web page pair wi,pi, we trained the learner on a dataset constructed from all other wrapper/page pairs: that is, from the pairs < w1,p1 > ,...,< wi-1,pi-1>,<wi+1,pi+1 > ,..., < wm,pm > . We then tested the learned extraction heuristics on data constructed from the single held-out page pi, measuring the recall and precision of the learned classifier.4
This results in 168 measurements, two for each page. Before attempting to summarize these measurements we will first present the raw data in detail. All the results of this ``leave one page out'' experiment (and two variants that will be described shortly) are shown in the scatterplot of Figure ; here we plot for each page pi a point where recall is the x-axis position and precision is the y-axis position. So that nearby points can be more easily distinguished, we added 5% noise to both coordinates.5
The scatter plot shows three distinct clusters. One cluster is near the point (100%,100%), corresponding to perfect agreement with the target wrapper program. The second cluster is near (0%,100%), and usually corresponds to a test case for which no data at all was extracted.6 The third cluster is near (50%,100%) and represents an interesting type of error: for most pages in the cluster, the learned wrapper extracted the anchor nodes correctly, but incorrectly assumed that the text node was identical to the anchor node. We note that in many cases, the choice of how much context to include in the description si of a URL ui is somewhat arbitrary, and hand-examination of a sample these results showed that the choices made by the learned system are usually not unreasonable; therefore it is probably appropriate to consider results in this cluster as qualified successes, rather than failures.
For an information integration system like WHIRL--one which is somewhat tolerant to imperfect extraction--many of these results would acceptably accurate. In particular, results near either the (100%,100%) or (50%,100%) clusters are probably good enough for WHIRL's purposes. In aggregating these results, we thus considered two levels of performance. A learned extraction heuristic has perfect performance on a page pi if recall and precision are both 1. An extraction heuristic has e -good performance on a page pi if recall and precision are both at least 1 - e ,or if precision is at least 1 - e and recall is at least 1/2 - e. The table in Figure shows the number of perfect and e -good results in conducting the leave-one-out experiment above.
We will use e = 5% as a baseline performance threshold for later experiments; however, as shown in Figure , the number of e -good pages does not change much as e is varied (because the clusters are so tight). We believe that this sort of aggregation is more appropriate for measuring overall performance than other common aggregation schemes, such as measuring average precision and recall. On problems like this, a system that finds perfect wrappers only half the time and fails abjectly on the remaining problems is much more useful than a system which is consistently mediocre.