next up previous
Next: Features and learning methods Up: Extraction as classification Previous: Extraction as tree rewriting

   
Extraction of lists as classification

Most existing learning systems learn to classify: that is, they learn to associate a class label from some small, fixed, set with an unlabeled instance. To use a classification learner on an extraction problem, it is necessary to re-cast extraction as a labeling task.

It is straightforward to use labels to encode the output of a wrapper for a simple lists or a simple hotlists. Since each data item extracted by the wrapper corresponds directly to a node in the parse tree, one can encode output of wrapper by appropriately labeling parse tree nodes. To encode a simple list, label a node ni as ``positive'' if it is associated with some extracted string si, and ``negative'' otherwise. For instance, in the parse tree for the simple list in Figure [*], every td node would be labelled as positive. Given a correctly labeled tree, data can be extracted by simply finding each positive node ni, and extracting all the text below it as si, the i-th entry in the extracted list.

Encoding a simple hotlist with parse tree labels can be done in a similar way. For a simple hotlist, label a node ni as ``positive'' if it is associated with some extracted string si or some extracted URL ui, and ``negative'' otherwise. For instance, in the parse tree for the simple hotlist in Figure [*], every a (anchor) node would be labelled as positive, as well as every li node. To extract data from a correctly labeled tree, one examines each outermost positively labeled node yi, and does the following. If yi contains some positive node zi, then for each such zi, extract the pair < si,ui > where si consists of all text below yi, and ui is the href attribute of zi. If yi does not contain any positive nodes, then treat yi as both the ``text node'' and the ``anchor node'': that is, extract the pair < si,ui > where si consists of all text below yi, and ui is the href attribute of yi (which must be an anchor).

To summarize, for simple lists and hotlists, the task of extracting data from a Web page can be re-cast as the task of labeling each node in the HTML parse tree for the page. By the same token, a wrapper can be represented as a procedure for labeling parse tree nodes. Such a node-labeling procedure can be learned from a sample of correctly labeled parse tree nodes. A set of correctly labeled parse tree nodes, in turn, can be easily generated given an existing wrapper and a page that is correctly wrapped.

We thus propose the following procedure for learning general, page-independent extraction procedures. Begin with a set of wrappers w1,...,wN that correctly wrap the Web pages p1,...,pN For each wi,pi pair, find the parse tree for pi, and label nodes in that tree according to wi. This results in a set of labeled parse tree nodes < ni,1,li,1 >,..., < ni,mi,li,mi > , which are added to a data set S. Finally, use S to train some classification learner. The output of the learner is a node-labeling procedure h, which is a function mapping parse tree nodes to the set {positive, negative}. The learned function h can then be used to label the parse tree nodes of new Web pages, and thereby to extract data from these pages.

It remains to describe the learning method, and the way in which parse tree nodes are encoded for the learner.


next up previous
Next: Features and learning methods Up: Extraction as classification Previous: Extraction as tree rewriting
Wei Fan
1999-03-04