Most existing learning systems learn to classify: that is, they learn to associate a class label from some small, fixed, set with an unlabeled instance. To use a classification learner on an extraction problem, it is necessary to re-cast extraction as a labeling task.
It is straightforward to use labels to encode the output of a wrapper
for a simple lists or a simple hotlists. Since each data item
extracted by the wrapper corresponds directly to a node in the parse
tree, one can encode output of wrapper by appropriately labeling parse
tree nodes. To encode a simple list, label a node ni as
``positive'' if it is associated with some extracted string si, and
``negative'' otherwise. For instance, in the parse tree for the
simple list in Figure
, every td node would be
labelled as positive. Given a correctly labeled tree, data can be
extracted by simply finding each positive node ni, and extracting
all the text below it as si, the i-th entry in the extracted
list.
Encoding a simple hotlist with parse tree labels can be done in a
similar way. For a simple hotlist, label a node ni as ``positive''
if it is associated with some extracted string si or some extracted
URL ui, and ``negative'' otherwise. For instance, in the parse
tree for the simple hotlist in Figure
, every a
(anchor) node would be labelled as positive, as well as every li
node. To extract data from a correctly labeled tree, one examines
each outermost positively labeled node yi, and does the following.
If yi contains some positive node zi, then for each such zi,
extract the pair
<
si,ui
>
where si consists of all
text below yi, and ui is the href attribute of zi. If
yi does not contain any positive nodes, then treat yi as both
the ``text node'' and the ``anchor node'': that is, extract the pair
<
si,ui
>
where si consists of all text below
yi, and ui is the href attribute of yi (which must be an
anchor).
To summarize, for simple lists and hotlists, the task of extracting data from a Web page can be re-cast as the task of labeling each node in the HTML parse tree for the page. By the same token, a wrapper can be represented as a procedure for labeling parse tree nodes. Such a node-labeling procedure can be learned from a sample of correctly labeled parse tree nodes. A set of correctly labeled parse tree nodes, in turn, can be easily generated given an existing wrapper and a page that is correctly wrapped.
We thus propose the following procedure for learning general, page-independent extraction procedures. Begin with a set of wrappers w1,...,wN that correctly wrap the Web pages p1,...,pN For each wi,pi pair, find the parse tree for pi, and label nodes in that tree according to wi. This results in a set of labeled parse tree nodes < ni,1,li,1 >,..., < ni,mi,li,mi > , which are added to a data set S. Finally, use S to train some classification learner. The output of the learner is a node-labeling procedure h, which is a function mapping parse tree nodes to the set {positive, negative}. The learned function h can then be used to label the parse tree nodes of new Web pages, and thereby to extract data from these pages.
It remains to describe the learning method, and the way in which parse tree nodes are encoded for the learner.