KPS --- a Web Information Mining Algorithm

Tao Guan Kam Fai Wong
Department of Computer Science Department of System Engineering&
Engineering Management
University of Regina The Chinese University of Hong Kong
Regina, Sask, Canada S4S 0A2 Shatin, NT, Hong Kong
Email: Email:



The Web mostly contains semi-structured information. It is, however, not easy to search and extract structural data hidden in a Web page. Current practices address this problem by (1) syntax analysis (i.e. HTML tags); or (2) wrappers or user-defined declarative languages. The former is only suitable for highly structured Web sites and the latter is time-consuming and offers low scalability. Wrappers could handle tens, but certainly not thousands, of information sources. In this paper, we present a novel information mining algorithm, namely KPS, over semi-structured information on the Web. KPS employs keywords, patterns and/or samples to mine the desired information. Experimental results show that KPS is more efficient than existing Web extracting methods.



Information Extraction, Information Retrieval, Web Query and Web Databases


1 Introduction

The World Wide Web (WWW) provides a vast source of information. Compare with traditional databases, Web information is dynamic, semi-structured and interwound with hyperlinks. Also, it can be represented in different forms and is globally shared over multiple sites and platforms. Hence, querying over the WWW is significantly different from querying the data from traditional databases, e.g. relational databases, which are structured, centralized and static. Recent researches [3, 6, 13, 18, 20, 21] attempt to exploit the topology and inner structures of Web pages in WWW queries. A number of web query languages (or data manipulation languages) have been developed[14]. Queries in these languages are based on the semi-structured data model which is mostly represented as labeled graphs. Thus, the hidden information in textual web pages needs to be transformed to graphical representations by site-specific wrappers (or user-defined declarative languages)[17, 24] or syntax analysis (e.g. HTML tags or font size) [5], The wrapper approach is time-consuming and does not scale-up. It can cope with tens of information sources; but it is ineffective for thousands. On the other hand, the syntax analysis approach is only suitable for highly structured Web sites and fails to handle many common cases (see later).

Most Web documents are text-oriented. Many relevant information is usually embedded in the text and could not be explicitly or easily specified in a user query. For example, the biography of a professor is usually written in a text string. Thus, the relation ( Time, Degree, School), which represents when, what and where a professor received his/her degree, could not be specified easily. The representation of data in XML format could partially reduce the difficulty. But there are still many text-based Web pages (e.g. on-line news or newspapers) which are hardly structured or are represented in an unified XML DTD (Document Type Definition). The information sources may contain relevant data that is not tagged with specific items. Sometimes the data is simply not tagged and other times, the tags employed are different even if the items being described are the same or similar. Thus, ad hoc queries for Web information extraction is still required. In this paper, we present a novel approach known as KPS for extracting information from Web pages(In this paper, the terms "information mining" and "information extraction" are synonymous). They will be used interchangeably.. Our goal is to extract relevant information either automatically, or with minimal human intervention. Rather than representing a web page as a labeled graph or a relation as in current practices[6, 18, 20, 21], the KPS algorithm mines the desired information from irregular pages directly using keywords, patterns and/or samples. To do this, we assume that:


  1. important information is always highlighted by keywords or meaningful structures since good visual effects are common practice for representing Web data;
  2. common patterns exist in English, e.g. the word after "Dr." or "Mr." should be a name; and
  3. similar structures or patterns appear in the Web pages of the same organization for these pages are often written by the same webmasters and thus similar styles (or even simple copies) are employed.

Therefore, a set of heuristic rules or patterns can be used to identify these information on the WWW. Our experiment shows that the KPS algorithm for extracting information from the semi-structured WWW is more efficient than any conventional approaches since complicated learning or feedback are not required, but the results are competed.

Our mining algorithms can be used in web query languages to replace wrappers(it is independent to special sites) or used in metasearching[15], which may further filter the information obtained from general search engines.


Related Work

The structural information extraction from the web is studied in [2, 5, 7, 9, 17, 24, 25]. [5] works on syntax analysis and applies font size and indentation to structure a page. It is, however, inadequate for some pages. [17, 24] allows users to define a pattern or to use a declarative language for extracting structures. It is, however, only designed for site-specific applications. In contrast, [2, 9, 10, 18, 19, 25] practices a different approach and employs machine-learning to extract information. Extensive corpus training is required in this approach to achieve good effectiveness. This is similar to our sample-based mining method. But our algorithm is much simpler and efficient.

On the other hand, information extraction (IE)[11, 12], an independent research field, has been studied for a long time. Its application to Web data is discussed in [25]. Since it requires domain-specific semantic knowledge and corpus training, it is only suitable for domain-specific applications. Information extraction algorithms for Web data is different from classic ones as they commonly make use of the semi-structured features of the Web pages (i.e. HTML tags).

Text mining (or knowledge discovery in text) has recently been addressed by [4]. It aims to extract concepts and their relationships from a set of textual documents. Part of the work, e.g., recognition of keywords and their assignment to simple categories and roles (e.g. Person, Place, Company), is close to our study.

Other relevant work include Web query languages (e.g. W3QS[18], WebSQL[21] and WebLog[20]), Web data manipulation languages (e.g. WebOQL[3] and StruQL[13]), semistructured data[1] and unstructured data[8].

The rest of the paper is organized as follows: Sections 2 to 4 introduce the three information extracting approaches in KPS, respectively. In Section 5, the experimental results are outlined and analysed. Finally, Section 6 concludes the paper.


2 Keyword-based Mining

Keyword-based mining is used to extract a value associated with a keyword, e.g. E-mail, Publication, or Research interests. When a keyword is specified, we first look for the keyword in the pages. Once it is located, the following heuristic rules will be applied to mine the target information automatically.


If the keyword denotes a number (in a conditional expression), the target information is the first number occurring in the string obtained from the above rules.

In addition, synonyms or hyponyms are also considered in locating keywords. Here, word A is the hyponym of word B, if A and B refers to the same semantic class and B describes something more general than A. For example, father is a hyponym of parent and the parent is a hyponym of relative (see Figure 2). If one is looking for the keyword parent, the word father could be a match. Our synonymy and hyponymy dictionary is based on WordNet[22], which is the product of a research project at Princeton University. It is a dictionary which models the lexical knowledge of a native English speaker.


3 Pattern-based Mining

Pattern-based mining performs string matching on the WWW based on the patterns specified by the users. A pattern entails a number of constant words or variables (started with /) and is delimited by a pair of brackets. When it is specified, the system first locates the strings from pages which match with the constant words in the pattern. If the matching is successful, the corresponding values on the page will be assigned to the variables. For example, the pattern [Dr. /Name] matches with a string starting with the sub-string "Dr." followed by a noun phrase. The noun phrase after "Dr." will then be assigned to the Name variable, e.g. in the string "Dr. Jack Boan is ...", "Jack Boan" will be assigned to Name.

The reason why we focus on noun phrases is that we assume most interesting information is represented in noun phrases or number (verbs usually indicate an action or a state). The definition for noun phrases is as follows:


      NP -> NP2 | Det NP2 | NP Conj NP  
      NP2 ->Noun | Noun NP2 | Adj NP2 | NP2 PP  
      PP ->Prep NP  

where NP denotes noun phrase, Det denotes determiner, Conj denotes conjunction, Adj denotes adjective and Prep denotes preposition.

Two or more patterns may be linked using boolean operators. For example, the expression "[Mr. /Name] or [Ms. /Name]" means the word matches with any one of the patterns can be set to the variable Name. Of course, it is possible that more than one word can be assigned to a variable. In this case, we will count the number of matches of a word and choose the one with the highest frequency or the first one if two or more words have the same maximum frequency count.

In addition, wildcards can be used in a pattern, i.e. "*" and "-", where "-" denotes a word and "*" represents any number of words. For example, the pattern [Dr. /Name received /Degree from * in /Year] is to extract information on variables Name, Degree and Year, while the university is ignored.


4 Sample-based Mining

The sample-based mining method extracts information based on a sample specified by the users. It is based on the assumption that a small group of Web pages is likely written in similar structures and styles. This is typically the case for the intranet of an institute in which all Web pages are designed by a webmaster. Therefore, when a user is looking for something (e.g. email address), he/she may first locate one (i.e. the sample in this paper) from a page manually and informs the system that he/she would like to obtain similar pages. The system will then help him/her search other pages automatically.


4.1 The Formal Model

Let W={w1, w2, ...., wn} be a group of Web pages which have similar style. Each page (source code, i.e. a HTML file) wi can be broken into a list of fields:


	  wi=(fi1, fi2, ...., fik);

where each field may be:


Note that the above-mentioned definitions of a field is not exhaustive. More definition will be introduced in the future.

Definition 1 An object o is a list of continuous fields appearing in the body of a Web page and the first and last elements cannot be HTML tags. For example, consider the first HTML example in the Appendix. The professor's name Bill Brown may be represented as an object


	     o = (Bill, Brown);

Please note that the ordering is very important in an object, that is, (Bill, Brown) is different from (Brown, Bill).

Furthermore, a sample is a specific object indicated by the users. Usually, the users choose the sample by a browser interface and thus they do not know the HTML tags hidden inside it.

Next we discuss how to mine the desired information by the sample specified by the users. We will consider two kind of similarities between the sample and potential objects. They are known as pattern similarity and style similarity. We discuss them respectively in the following sub-sections.


4.2 Pattern Similarity

Pattern similarity measures how much two objects match with each other. Before we introduce it, we need to define the concept of matched fields.

Definition 2 A field fi matches with another field fj, denoted by fi = fj, if


For example,


     <B> = <B>,                    'good' = 'good',    123 = 17.86, 

     '10 Jun 1988' = '01/06/67',    HK$100 = $20,      '.' = '.'

However, <B> does not match with </B> and nor does 'good' with 'hot'. The intuition behind it is that the words are general elements in the Web pages so that they cannot be used to identify an object (only the same words may suggest the similarity to some extents). In contrast, HTML tags, numbers, dates, times, prices and symbols are special data and thus characterize an object. However, by our experience, we find that HTML tags and symbols are usually fixed among a set of similar objects, but number, dates, times and prices are variable. Therefore, we require that the former have the same values, but not for the latter.

Definition 3 An object p=(p1, p2, ...., pm) is a sub-object of q=(q1, q2, ...., qn) if there is a sublist of q, (qi1, qi2, ...., qim) such that for each k, 1 < k < m-1 => ik < ik+1 and 1 < k < m => pk = qik.

Definition 4 An object p=(p1, p2, ...., pm) matches with the object q=(q1, q2, ...., qn), denoted by p= q, if m = n and for each k, 1 < k < m => pk = qk.

Now we can define the pattern similarity measure between two objects.

Definition 5 For objects p and q, the pattern similarity measure, denoted by PSM, is the maximum size of the sub-objects pi and qj such that pi = qj.

For example, for objects p=(Phone, :, (, 306, ), 781, -, 7488) and q=(Tel, :, 1, -, 306, -, 781, -, 7453), the maximum matched sub-objects are: (:, 306, 781, -, 7488) = (:, 1, 306, -, 781). Thus, PSM( p, q)=5.

In order to reduce the effect of false matches between large objects, such as The student is called as Tom and A student is reading a book written by Tom., we first consider the ratio of PSM to the average size of the objects to reflect the real pattern similarity. However, for two objects whose average size is 10 and PSM is 5, the ratio is 50% which is the same as that of objects whose average size is 2 and PSM is 1. In practice, the former should have a higher similarity since the one matched field in the latter may be a false match. Therefore, we multiply the ratio by the PSM to reflect it. The final result is called pattern similarity score (denoted by PSS).


	  PSS(p, q) = (PSM(p, q)/((size(p)+size(q))/2))*PSM(p, q)*100;

In the above example, PSS( p, q)=(5/((8+9)/2))*5*100= 290;

If we assume the object p is the sample, then the PSS is enough to distinguish the object q from other parts. However, if there is another object l=(Fax, :, 1, -, 306, -, 781, -, 7453) (e.g. in the second example in the Appendix), it is hard to determine which ( q or l) should match with p since PSS( p, q)= PSS( p, l). In this case, the context should be taken into consideration. Before we discuss it, we first introduce a new concept segment.

Definition 6 A segment is a list of continuous fields in a web page, but does not contains HTML tags.

If we consider all HTML tags as segment delimiters in web pages, then the body (between <body> and </body>) can be viewed as a list of segments. (Since the head (i.e. the text between <Head> and </Head>) of a page usually cannot be viewed by users in a browser interface, we thus do not consider it in the extraction process). For instance, example 1 in the Appendix can be represented as

              Segment1=(Faculty, Profile);

              Segment2=(Bill, Brown, - , Associate, Professor);

              Segment3=(brown, @, cs,  ., usask,  ., ca);



Now we define Pre-marker and After-marker which are used to identify the beginning and end of an object.

Definition 7 For object o, the Pre-marker is defined as follows:

     If the first element of o is the first field of a segment

     and the segment is the first segment in the page, then

           Pre-marker is  <beg-file>;

     else if the first element is the first field of a segment, then 

           Pre-marker is  <beg-seg>;


           Pre-marker is the field immediately before the first element.

Definition 8 For object o, the After-marker is defined as follows:

     If the last element of o is the last field of a segment

     and the segment is the last segment in the page, then

           After-marker is  <end-file>;

     else if the last element is the last field of a segment, then 

           After-marker is  <end-seg>;


           After-marker is the field immediately after the last element.

  For example, Pre-Marker(o)= <beg-seg> and After-Marker(o)= '-' in example 1, where o is (bill, Brown).

  Furthermore, Pre-part and After-part are used to identify the context of an object.

Definition 9 The Pre-part of an object o is defined as:

      If the first element of o is the first field of a segment

      and the segment is the first segment in the page, then

            Pre-part is  <beg-file>;

      else if the first element of o is the first field of a segment, then

           Pre-part is the segment immediately before the segment containing
	   the first element of o;


           Pre-part is the list of fields from the beginning of the segment 
	   containing the first element of o until the  <pre-marker>(including it). 

Definition 10 The After-part of an object o is defined as:

      If the last element of o is the last field of a segment

      and the segment is the last segment in the page,  then

            After-part is  <end-file>;

      else if the last element of o is the last field of a segment,  then

            After-part is the segment immediately after the segment containing
	    the last element of o;


            After-part is the list of fields from the  <After-marker> until 
	    the end of the segment containing it.

  For example, for the object o in Example 1, we have,

<Pre-part>(o) = <beg-seg> and <After-part>(o) = (-, Associate, Professor);

The <Pre-part> and <After-part> can be used to calculate the pattern similarity between the context of the objects. For special fields, e.g. <beg-file> and <end-file>, the PSS is obtained from the following table,

Thus, the total pattern similarity score( TPSS) between two objects is


       TPSS(p, q) = PSS(p, q) + 0.5*PSS(Pre-part(p), Pre-part(q)) +
				0.5* PSS(After-part(p), After-part(q)); 

Let us look at the above example again, we already have


        PSS(p, q)= PSS(p, l) = 290;



   PSS(Pre-part(p), Pre-part(q)) = PSS((Office, :, 145, Engineering, Building), 
  			               (Office, :, 153, Engineering, Building))= 500;

   PSS(After-part(p), After-part(q)) = PSS((Fax, :, (, 306, ), 781, -, 6754),
			                    (Fax, :, 1, -, 306, 781, -, 6754)) = 290;

   PSS(Pre-part(p), Pre-part(l)) = PSS((Office, :, 145, Engineering, Building), 
			                (Tel, :, 1, -, 306, -, 781, -, 7453)) = 57;

   PSS(After-part(p), After-part(l)) = PSS((Fax, :, (, 306, ), 781, -, 6754),
			                   (Research, Interests)) = 0;

   TPSS(p, q) = 290 + 0.5*500 + 0.5*290 = 685;

   TPSS(p, l) = 290 + 0.5*57 + 0.5*0 = 318.5;

Thus, q should match with p rather than l.


4.3 Style Similarity

Although HTML tags are often used in Web pages for cosmetic reasons, they provide important clues in locating the desired information. Usually, pages from an institute have a similar style since many public pages are designed by the same professionals. This property can be used for Web searching. Since the style can be determined by the HTML tags in Web pages, we first introduce the concept HTML-path of an object.

Definition 11 For an object o, the HTML-path is a list of HTML tags (t1, t2, ...., tm), where


Actually, the HTML-path is a maximum length list of HTML tags which may take effect on the objects. For example, two objects o=(Bill, Brown) in example 1 and m =( Alan, Bell) in example 2 (see Appendix),


          HTML-path( o) = (<table>, <tr>, <td>, <p>);

          HTML-path( m) = (<table>, <tr>, <td>, <p>);

When the HTML-path of an object is determined, we may compare it with the HTML-path of the sample. The more they are matched, the more similar style they share. Therefore, if we assume the HTML-path to be a special object (a list of HTML tags), the Style Similarity Score(SSS) can be used to quantify style similarity:


     SSS(p, q) = (PSM(p, q)/((size(p)+size(q))/2))*PSM(p, q)*100;

For example,


     SSS(HTML-path(o), HTML-path(m)) = (4/((4+4)/2))*4*100 = 400;


4.4 The Algorithm

We have discussed pattern similarity and style similarity in the above sections. They can be used to extract the desired information based on the sample specified by the users. The following is our algorithm:


Note that all segments which are not between any <Pre-marker> and <After-marker> are marked as potential objects since the <Pre-marker> and <After-marker> sometimes cannot reflect the real situation and may cause a mistake. We may, therefore, need to consider other potential values.

Remark The complexity of above algorithm is good. For each potential page, we just need two scans. In the first one, all potential objects (including HTML-paths) can be obtained. Then we calculate TPSS and SSS between them and samples in the second scan. Since objects are represented by lists, the matches can be done in linear time. Therefore, the total complexity in processing a page is in O(n), which n is the size of the page. Of course, if we apply this algorithm on a large site, the number of potential pages may be large. We thus need preprocess to deal with it. See details in [16].


5 Experimental Results

A novel approach to retrieval information from the Web is discussed in this paper. Although it cannot guarantee 100 percent success in mining the desired information, our initial experience shows that it is practical. The following describe the experiments for evaluating the practicality of the approach:

We tried to extract the name and email addresses of all professors in an academic institute, e.g. an university or a department. For simplicity, we only considered 10 Web sites which are selected from the WWW. The site at the University of Regina contains more than 300 faculties whose homepages are written in various styles and the faculty of engineering at the University of Alberta has 95 professors coming from four departments. Other sites are departmental sites and is easy to obtain the exact number of professors from them But some Web pages (e.g. are developed by the same webmaster and others are written by the professors themselves (e.g. We first test keyword-based mining for name and email and then repeat same testing on pattern-based mining and finally sample-based mining.


Furthermore, we attempt to compete our algorithms with other proposed algorithms[2, 5, 7, 9, 17, 18, 19, 24, 25]. However, comparisons are difficult since there is no common evaluation platform available for web information extraction, like MUC for information extraction on textual files. Most systems are tested on a number of selected web sites, but the contents or formats may be changed later. For example, we also run the sample-based mining on the site at The Australian Bureau of Meteorology to extract the weather information(i.e. location, day, condition, high, low). Our recall and precision is 80% for location. The result is much better than Soderland's in [25], where the recall is 23.8\% and the precision is 81.4% for the same information. Of course, Soderland's test was done one year ago so the tested data may be changed. Anyway, our approach just need samples specified by users. This is much simpler and efficient than Soderland's. In the experiments, we found that many information was actually highlighted by keywords or common patterns; and similar style existed in a Web site. Users may choose different mining algorithm by the characteristics of the sites. For example, publications or research interests is best mined by keywords-based mining since most of them are highlighted by the same keywords in Web papers (of course, there are some cases in which selected papers is used rather than publications). Email, telephone and mail address can be handled by pattern-based mining because they almost have the same pattern in a city. In contrast, name and biography may be targets of sample-based mining. Of course, there were few exceptions:


The worst cases are Web pages designed by individuals (e.g. homepages at a CS Dept.), which are in various styles. However, a special group of pages sometimes share common structures. For example, most professors' homepages have the following structure, name, title, address, biography, teaching, research interests and publications. Therefore, the extraction is not difficult even if they come from different place.


6 Conclusion and Further Works

We discuss how to mine and extract information from the WWW in this paper. The novelty of our method is the employment of keywords, patterns and samples to extract information from semistructured textual Web pages. Although it cannot mine all desired information, our experiments show that it is more efficient than other existing methods. In the future, we will apply more semantic knowledge and NLP techniques to mine complicated concepts, e.g. biography.



  1. S. Abiteboul, D. Quass, J. Mchugh, J. Widom and J. Wiener, The Lorel query language for semistructured data, Journal of Digital Libraries, 1(1), 1997, 68-88.
  2. B. Adelberg , NoDOSE - A tool for semi-automatically extracting structured and semistructured data from text documents, in Proceedings of SIGMOD'98, 1998, 283-294.
  3. G. Arocena and A. Mendelzon, WebOQL: Restructuring documents, databases and webs, in ICDE'98, Orlando, Florida, 1998, 24-33.
  4. C. Arnie , Steps and Issues for Knowledge Discovery in Text, manuscript
  5. N. Ashish and C. Knoblock , Wrapper generation for semistructured Internet sources, SIGMOD Record, 26(4), 1997, 8-15.
  6. P. Atzeni, G. Mecca and P. Merialdo, Semistructured and structured data in the Web: going back and forth, SIGMOD Record, 26(4), 1997, 16-23.
  7. S. Brin, Extracting patterns and relations from the world wide web, In Proceedings of International Workshop on the Web and Databases(WebDB), 1998
  8. P. Buneman, S. Davidson, G. Hillebrand and D. Suciu, A query language and optimization techniques for unstructured data, in Proceedings of SIGMOD'96, 1996, 505-516.
  9. M. Craven et al , Learning to extract symbolic knowledge from the world wide web, in Proceedings of AAAI-98, 1998, 509-516.
  10. W. Cohen and Y. Singer, Learning to query the web, In Proceedings of AAAI Workshop on Internet-Based Information Systems, 1996
  11. M. Costantino, R.G. Morgan, R. J. Collingham and R. Garigliano, Natural language processing and information extraction: Qualitative analysis of financial news articles, In Proceedings of the Conf. on Computational Intelligence for Financial Engineering, 1997
  12. J. Cowie and W. Lehnert, Information extraction, CACM, 39(1), 1996, 80-101.
  13. M. Fernandez, D. Florescu, A. Levy and D. Suciu, A query language for a web-site management system, SIGMOD Record, 26(3), 1997, 4-11.
  14. D. Florescu, A. Levy and A. Mendelzon, Database techniques for the world-wide web: a survey, SIGMOD Record, 27(3), 1998, 59-74.
  15. L. Gravano and Y. Papakonstantinou, Mediating and Metasearching on the Internet, Bulletin of the IEEE Data Engineering, 1998, 28-36.
  16. T. Guan and Kam-Fai Wong, Nstar: An interactive tool for local web search, submitted to publication.
  17. J. Hammer, H.G. Molina, J. Cho, R. Aranha and A. Crespo, Extracting semistructured information from the Web, In Proceedings of 1st Workshop on Management of Semistructured Data, Arizona, 1997.
  18. C. Hsu, Generating finite-state transducers for semi-structured data extraction from the web, Information Systems 23(8), 1998, 521-538.
  19. D. Konopnicki and O. Shmueli, W3QS: A query system for the world wide web, in VLDB'95, Zurich, 1995, 54-65.
  20. N. Kushmerick, Daniel S. Weld and R. Doorenbos, Wrapper induction for information extraction, in Proc. of the 15th International Joint Conference on Artificial Intelligence, 1997, 729-737.
  21. L.V.S. Lakshmanan, F. Sadri and I.N. Subramanian, A declarative language for querying and restructuring the Web. In Proceedings of 6th. International Workshop on Research Issues in Data Engineering, RIDE'96, New Orleans, 1996, 12-21.
  22. A. Mendelzon, G. Mihaila and T. Milo, Querying the World Wide Web, in Proceedings of First Int. Conf. on Parallel and Distributed Information System, 1996, 80-91.
  23. G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller, Introduction to WordNet: an on-line lexical database, International Journal of Lexicography, 1993.
  24. D. Smith and M. Lopez, Information extraction for semi-structured documents, In Proceedings of 1st Workshop on Management of Semistructured Data, Arizona, 1997.
  25. S. Soderland , Learning to extract text-based information from the world wide web, In Proceedings of 3rd International Conf. on Knowledge Discovery and Data Mining (KDD-97), 1997, 251-254.