In most of our experiments we used the rule learning system RIPPER [2]. RIPPER has some advantages for this problem: in particular, it handles the ``set-valued'' features (described below) directly, and is efficient enough to use on problems of this scale. (About 65,000 parse-tree node examples are generated from the 84 wrapper/page pairs). The main reason for using RIPPER, however, was that we were familiar with it; as we show later, other learning systems achieve comparable results on this problem.
RIPPER, like most classification learners, requires an example to be represented as a vector of relatively simple features. The value of each feature is either a real number, or else a symbolic feature--an atomic symbol from a designated set, like {true, false}. The primitive tests allowed for a real-valued feature are of the form f<=q or f>=q, where f is a feature and q is a constant number, and the primitive tests allowed for a symbolic feature are of the form f=a_{i}, where f is a feature and a_{i} is a possible value for f. RIPPER also allows set-valued features [3]. The value of a set-valued feature is a set of atomic symbols, and tests on set-valued features are of the form a_{i} IN f, where f is the name of feature and a_{i} is a possible value (e.g., ul IN ancestorTagNames). For two-class problems of this sort, RIPPER uses a number of heuristics to build a disjunctive normal form formula that generalizes the positive examples. This formula is usually thought of as a set of rules, each rule having the form ``label an instance `positive' if t_{1} and t_{2} and ... '', where each t_{i} in the rule is a primitive test on some feature.
After some thought, we devised the following nineteen features to describe a parse tree node. The tag name is the HTML tag name associated with the node, such as a, p, br, and html. This is an informative feature: some tags such as head are always negative, while, other tags such as the anchor tag a are often positive. We measured the size of the string directly associated^{3} with a node in two ways: the text length is the total size of all the text associated with a node, and the non-white text length is similar to text length but ignores blanks, tabs, and line returns. We also measured the length of text contained in the subtree rooted at the current node by the features recursive text length and recursive non-white text length; these features are important because they measure the size of the string s_{i} that would be extracted if the node were marked as positive. The features set of ancestor tag names, depth, number of children, number of siblings, parent tag name, set of child tag names and set of descendent tag names are other natural and easily-computed features. Since the size of parse trees varies considerably, we also normalize many of the above features by the total number of nodes or by the maximal node degree.
The final features we designed are intended to detect and quantify repeated structure in the parse tree; intuitively, this is important, because positive nodes often reside on a structure frequently repeated in the tree. The repetitive aspect of a structure can often be detected by looking at the sequence of node tags that appear in paths through the tree; for instance, in the parse tree for bibliography page of Figure , there are many paths labeled li-a that originate at the ul node.
To measure this sort of repetition, let tag(n) denote the tag associate with a node n, and define the tag sequence position of n, p(n), as the sequence of tags encountered in traversing the path from the root of the parse tree to n: that is, p(n) = < html,...,tag(n) > . If p(n_{1})= < t_{1},...,t_{k} > . and p(n_{2})= < t_{1},...,t_{k},t_{k+1},...,t_{m} > , then we say the tag sequence position p(n_{1}) is a prefix of p(n_{2}); if additionally p(n_{1}) is strictly shorter than p(n_{2}), then we say that p(n_{1}) is a proper prefix of p(n_{2}).
We use the node prefix count for n as a way of measuring the degree to which n participates in a repeated substructure. The node prefix count for n, p_{count}(n), is the number of leaf nodes l in the tree such that the tag sequence position of nis a prefix of the tag sequence of l: more formally, p_{ count}(n) = |{l: p(n) is a tag sequence prefix of p(l), l is a leaf}|. The node suffix count for n, s(n), is closely related; it is defined as the number of leaf nodes l with tag sequence positions of which p(n) is a proper prefix. We normalize both p_{count}(n) and s_{count}(n) by the total number of paths in the tree.
These last features are clearly engineered, but are based on clear and plausible intutions; the tree-rewriting language specifies nodes to be re-written in terms of their tag sequence positions, and tools for writing programs in the tree-rewriting language are also based on looking for repeated tag sequences. Most of the other features were introduced with a fairly non-selective ``brainstorming'' process; here we were looking for features that were easy to compute and plausibly related to the classification task. No effort was made to to select an optimal set of features, or to include any sort of strong knowledge of the type of extraction programs we expected to see.
Rule 1. Label a node positive if tagName=a and normalizedNodeSuffixCount >= 0.445545.
Rule 2. Label a node positive if tagName=a , normalizedNodeSuffixCount >= 0.2666355, and depth <= 4. Rule 3. Label a node positive if normalizedNodePrefixCount >= 0.688645, tagName=p, and numSiblings >= 317. Rule 4. Label a node positive if tagName=td and normalizedNodePrefixCount >= 0.981132. |
As an illustration of the way in which these features are used in learning, Figure shows some representative rules that appear in the hypothesis obtained by training RIPPER on all of the 84 wrapper/page pairs. Rule 1 says that a node should be labeled as ``positive''--that is, part of a list to be extracted--if it is an anchor (a) element that repeats frequently, as measured by the normalized suffix count for the node. Rule 2 is similar; it marks anchor elements for extraction if they repeat somewhat less frequently, but are close to the root of the parse tree. Rule 3 marks paragraph (p) elements for extraction; since this type of element is less likely to be in a list, the rule requires very strong repetition, as measured both by the node prefix count, and by the number of siblings of the node. Rule 4 marks table entry elements (td) for extraction if they are part of a repeating structure that comprises most of the page. The complete hypothesis learned by RIPPER from this data contains a total of 36 such rules.