2Dept of Computer Science
3Dept of Electrical
This paper proposes two enhancements to existing search services over the Web. One enhancement is the addition of limited dynamic search around results provided by regular Web search services, in order to correct part of the discrepancy between the actual Web and its static image as stored in search repositories. The second enhancement is an experimental two-phase paradigm that allows the user to distinguish between a domain query and a focused query within the dynamically identified domain. We present Fetuccino, an extension of the Mapuccino system that implements these two enhancements. Fetuccino provides an enhanced user-interface for visualization of search results, including advanced graph layout, display of structural information and support for standards (such as XML). While Fetuccino has been implemented on top of existing search services, its features could easily be integrated into any search engine for better performance. A light version of Fetuccino is available on the Internet at http://www.ibm.com/java/fetuccino.
Keywords: search enhancements, search filtering, search results visualization, site mapping
The Web keeps growing at a phenomenal rate. From the perspective of search technology, this growth has two important characteristics. First, the "update" policy is totally uncontrolled, with millions of users creating, modifying and deleting content at will, and linking it to other content in an unstructured manner. Second, Web growth is increasingly fueled by the addition of dynamically and automatically generated content .
These characteristics impact the major criteria of evaluation for search services, namely effectiveness (i.e., the quality of results) and efficiency (i.e., response time). Effectiveness is typically evaluated in information retrieval (IR) by recall1 and precision2 , . Recall, in the context of the Web, requires a good coverage of the Web which is itself achieved via periodic and wide off-line "crawling" of the Web. Contents thus accessed is indexed and stored in inverted index repositories that are used at runtime to return results to queries. Because of the dynamic nature of the Web, there might be a substantial difference between the image of the Web that is stored in search repositories generated batch, and the actual Web. In addition to missing newly added documents, this batch - or simply put - static approach cannot address deletion or movement of documents that occurred since the last crawling (returning users the annoying broken link - AKA "404 error" - phenomenon. 3 In spite of the increasing computing power and network bandwidth, this discrepancy is likely to grow. At the same time, a fully dynamic search is clearly not scalable time wise. Thus, a combined approach is highly desirable.
The precision ratio is also a major challenge in Web-IR research, despite much work in this area in recent years. Improvement in Web search technologies include the taking into account of usage paths and users' recommendations such as in Alexa. Another family of advances has focused on analyzing the link topology4 of a sub-web, or more generally the hypertext structure, to discover "important" documents that should be assigned a higher score. Example approaches and systems along this direction include HyPursuit (), WebQuery (), , the "Hubs and Authorities" approach (, ) and Google ().
Nevertheless, most of these solutions still lack search context. This results in many returned entries that may all match the query but belong to different and unrelated domains. The lack of context stems from the inherent lack of structure in the Web. Some search engines, most notably Yahoo, address this problem by allowing to specify (sub) domains in which to evaluate the query. Such domain restriction significantly reduces the number of irrelevant results and narrows the search space, hence increasing precision. However, the domain hierarchy is manually (or at best semi manually) crafted as a taxonomy of predefined categories. The ability to support an "uncontrolled vocabulary" for domain specification is highly desirable from a usability perspective, but it requires addressing a set of non-trivial technical issues. First, the "domain views" (by analogy with the schema views in databases) would need to be dynamically identified. Second, users should be allowed to perform the search focused on a given topic within the scope of the dynamic domain.
The Fetuccino system, presented in this paper, addresses these issues by:
Fetuccino has been fully implemented as an extension of the Mapuccino site mapping system , previously known as WebCutter, . A light version of the service (without the two phase search) is already available on the IBM Corporate Java Page at http://www.ibm.com/java/fetuccino. A stand-alone "Personal Mapuccino/Fetuccino" system for Win 95/NT is also available for download on alphaWorks at http://www.alphaworks.ibm.com. The full-fledged applet (implemented as a Java bean and accepting XML-defined maps according to the SGF DTD ) and server have been integrated in IBM products including IBM WebSphere Site Analysis.
The rest of this paper is organized as follows: Section 2 describes
dynamic search and its embodiment in the Fetuccino light
service. Section 3 presents the two phase search paradigm, along with
some examples obtained with the full Fetuccino service. Section 4
summarizes the contributions of this paper and points to future work.
Dynamic search is primarily distinguished from conventional static search in that the former involves fetching the actual documents at the time the query is issued and analyzing their relevance to the search query on the fly, while the latter is based on evaluating the query against pre-computed repositories. Obviously, dynamic search cannot be employed "from scratch" because of the high cost of text analysis and the huge search space, and therefore it requires a starting point for recursive exploration as well as some limitation on the search space. For example, the WebGlimpse, system allows users to pose a query from any "WebGlimpse"-enabled page using the current page as a starting point for exploration, up to a predefined "hop" parameter (i.e. depth in the Web viewed as a graph). The rationale for searching for new relevant pages in the "vicinity" of other relevant pages follows from the hyper-text structure of the Web, where the hyper-link is likely to imply relevance of the target document to the source document.
Our basic idea is to invoke a conventional static search engine in order to obtain an initial set of "good" links to pages that are presumably relevant to the query, and continue the exploration from there. This dynamic exploration has several benefits. First, a more precise relevance analysis can be made since corpus statistics can be derived from a narrower domain than the full Web. Second, this process validates the existence and reachability of documents, i.e., it allows to detect "broken" links caused by documents or sites that do not exist anymore or documents/sites that are unreachable. Third, the dynamic exploration of outgoing links allows to discover pages that are relevant to the query but did not appear in the results of the static search
One way to look at this dynamic process is as an "automatic browsing" session, on behalf of the user, that significantly improves the starting point that the user would normally be at, had s/he used only static search. With static results only, such a user would typically conduct a manual browsing session, stumble on several "not found" or "not reachable" documents, and read irrelevant documents that only appear relevant. Then after finding relevant documents, s/he will probably further explore their links to find more relevant information. Using our method, much of this tedious and time consuming process can be avoided.
The approach of using page relevance as a heuristic to direct an exploration procedure has been first introduced by the "fish search" paradigm , which also capitalizes on the intuition that relevant documents often have relevant neighbors. Thus, it searches deeper under documents that have been found to be relevant to the search query, and stops searching in "dry" areas. In a previous paper , we introduced our own relevance based exploration that enhances the fish-search algorithm, for use in the context of site mapping. The algorithm, termed "shark search", improves the amount of relevant pages that are explored within a given amount of time. The original purpose of the shark-search algorithm was to dynamically build a Web site map, tailored to a given topic. We propose here to use the same kind of heuristic-based approach in order to dynamically explore the neighborhood of static search results. One key difference between this context of application and regular site mapping is that the root page is not manually specified by the user (via a URL), but dynamically generated as the result page from a given search service.
Figure 1: The Advanced Form in the Fetuccino service (light)
In addition to the benefits of dynamic search that were mentioned above, another key benefit of the exploration process is that near-by documents (in the Web topology) can be identified not as isolated pages but rather in the context of a structured Web neighborhood. Indeed, as the graph of the local sub-Web around each search candidate is fetched, a high-level view of that graph that shows the mutual relationships can be provided. This structural information is orthogonal to the content-based information upon which most traditional search engines base their scoring. Yet, knowing that many search results point to the same candidate (that is not necessarily a top candidate) gives an important clue on how "authoritative" that page is on a given subject. We discuss later how this concept of "authorities" has become more and more important in more recent search services. As a first approximation though, simply visualizing the structure of the results gives a (visual) clue of the importance of the document, as shown below.
The Fetuccino service (light version) embodies this dynamic crawling approach on top of existing search services. In the advanced options form, (see Figure 1 below), the user can select his/her favorite search service from a list of supported services, and control a set of parameters such as the time limit, depth (in terms of the hops to follow) etc.
Once the user clicks on the "Create the Results Map!" button, Fetuccino forwards the query to a selected search service. It then pushes those results back to the user's browser "as is" and at the same time feeds them to the dedicated crawler component, which starts the exploration within the time limit specified in the form. Note that during the exploration process, the user can already scan through the regular search results, which are provided in a separate browser window. Once the time limit is reached, a map of the results is sent back to the user, together with the Mapuccino visualization applet, and pops up on the client station.
The map view shows the search results page as root, and the top ranked candidates as first-level descendants. The shade of blue for each node label indicates the degree of relevance of each candidate, (the darker the more relevant) according to Fetuccino's built-in IR engine. Subsequent levels in the tree show the pages fetched by Fetuccino and how they are related to each other. The screendump below shows the added value of the interactive Fetuccino map, as discussed before:
Figure 2: Augmenting static candidates with
dynamically fetched results (regular tree layout)
A major problem when conducting searches over a large heterogeneous and uncontrolled document set such as the Web is with the quality of the results for ambiguous queries. Indeed, it happens quite often that the user discovers only after issuing a search that the query expression s/he picked has a different interpretation in a totally irrelevant domain. This might happen even if the user is an expert at expressing clear and precise queries, simply because of the large scope of the Web. This poor quality of the results is evidenced by low precision as well as low recall rates in the following cases:
One solution is to disambiguate the query by embedding the domain inside the query term. In our previous example, adding "architecture" to "Merced" would probably be enough to disambiguate the query. However, in our second example, adding the specific field of interest to "Michael Jordan" (e.g., "Cognitive Science") might narrow the search space too much (we give further experiments with this approach in Section 3.3) . Indeed, the disambiguating terms are not used by the search engine as a simple qualifier of the original query but as equally important terms. They might even overshadow the query. In our example, the term "Cognitive Science" might be rare enough to get more weight than "Michael Jordan". Another approach is to automatically cluster search results by contents, this approach first applied in regular IR systems , , is regaining attention in the context of the Web , . However, while clustering techniques are getting faster and more precise, clusters labels are still rarely as good as those provided by a manually defined ontology. Thus for quality control, some search services provide a precise ontology or taxonomy that allows he user to specify the domain in which the query should be evaluated, separately from the query term itself, as done in Yahoo. The user can browse through the available categories, and when the proper specificity is reached, the query is evaluated within the selected domain. The domain selection disambiguates the query and therefore improves the precision rate.
Fetuccino applies the same idea to dynamic search. The user specifies two terms: a domain term, and a query term. The search proceeds in two phases. In the first phase, the domain that matches the domain term is identified (notice that the domain is user-defined, as opposed to the common fixed domain hierarchy). In the second phase, the query term is dynamically evaluated within the domain. This approach, while promising, introduces two hard technical problems: how to dynamically identify a domain, and how to confine the dynamic search to this domain. Finding an optimal solution to these problems remains a far-reaching goal, however we propose here an experimental approach that performs well in many cases.
The general idea is to represent a domain by finding the central documents in it, and to approximate intra-domain search by exploring these pages dynamically. The problem of finding central documents in a domain has been researched extensively recently, but mostly in the context of general-purpose search. The common approach is to analyze the link topology and give documents a score that is independent of their relevance to specific queries and which represents how authoritative they are. We argue that this general approach is particularly effective for identifying central documents of a domain (as opposed to general-purpose topic search) since a domain is more likely to have such central nodes (which is not necessarily true for arbitrary terms), and domain names are typically less ambiguous than specific terms.
There are several approaches to finding important documents (that we refer to as "authorities" in the following, to reuse the term coined by Kleinberg ), and some are embodied on freely available Web search services, e.g., Google. We have chosen to take advantage of one of the implementations of Kleinberg's "Hubs and Authorities" approach, namely Clever , which "parasites" AltaVista to build the graph on which to compute authorities. Clever is one of the most advanced implementations of the Hubs and Authorities model and is conveniently located on the same IBM intranet as our Fetuccino prototype. In principle, however, any method for finding hubs and authorities as domain representatives could be used. Let us remind the reader that the "Hubs and Authorities" approach consists of identifying authoritative documents that are pointed to by many sites, and "hub" nodes that point to authoritative sites. This approach has had several variants and extensions (, ). It is also particularly effective for our purposes, since we have found that the best way to represent a domain for the intra-domain search phase is to select not only the top authorities, as one might think, but also top hubs. The former are likely to contain relevant domain information, and the latter are likely to point to pages with domain information. Since intra-domain search is effectively a dynamic exploration along the link topology, Hubs are equally important as domain seeds.
Figure 3: The Advanced Form in Fetuccino Alfredo
We have extended the Fetuccino light service presented in Section 2.3 to support the two phase approach introduced above. From a user-interface perspective, the main difference between the two versions of Fetuccino is that in this heavier version (code named "Fetuccino Alfredo"), the user is required to enter two query fields, a "domain query" field and a "focused query" field, instead of a single free-text entry. An example of the advanced form is shown in Figure 3.
Once the user clicks on "Create the Results Map" button, the server forwards the domain query to Clever, that computes a set of hubs and authorities (note that depending on the communication bandwidth between the Authority engine and the actual search server from which the graph is retrieved, this process might take up to a few minutes). The results are pushed back to the user's browser. See Figure 4 for an example of Clever output on the domain query "Chicago Bulls".
At the same time, the top authorities and the top hubs are fed as seeds to the
shark-directed crawler. The key point here is that the shark-directed search does not use
the same domain query, but a more focused query that does not in general include the
domain terms (so as not to bias the results, these should be implicit only in the dynamic
domain represented by the hubs and authorities). After the time limit is expired, the map
of augmented and filtered results is pushed back to the user's browser.
Figure 4: Results from Clever for the domain query "Chicago Bulls"
An example of such a map is given in Figure 5 below, where the top 3 results are highlighted in darker blue (and framed in red). The top one refers to "Bulls roster", the second to "Bulls transactions", and the third one (in the bottom right hand side) as well as the other nodes in a lighter blue shade, refer to "trades and transactions" but of other NBA teams - in the case of the 3d candidate : the LA clippers. This is due to the fact that, as explained before, the domain is not a manually crafted domain, and as such includes related information (other NBA teams) that does not perfectly match the "bulls" domain. An option would be to filter the domain by forcing each page to explicitly refer to the bulls, or re-specify the domain within the focused query. We believe though that for the sake of a better recall, it is better to show those results to the user, who gets additional clues from the relevance score (shade of blue) and the proximity to the domain seeds.
Figure 5:Fetuccino results for two phase query
"Chicago Bulls" and "trades and transactions"
We show below some comparative results on a set of queries which show the main advantages of using a two phase domain/focused query approach as compared to simply combining the two in a single query and submitting the query to a regular search service as well as authority-based search service.
Our first example is with the domain/focused queries: "Chicago Bulls/Trades and Transaction". We tried, a concatenated query "Chicago Bulls, trades and transaction" (in free-text rather than Boolean form) against Excite, Alta-Vista and Clever. All of Excite and Clever results were most dominated by results relevant to the "Chicago Bulls". For Clever this was to be expected as the hubs and authority model specifically states that it works best for "broad" queries. In any case, Clever did return indeed good Bulsl authorities, while Excite returned the official site of the bulls only as a 5th candidate, while a fan's page was ranked 3rd. Alta-Vista had stranger results. On the day we issued the query, we got on the top 10 results, the 5th and 6th candidates untitled "Chicago Bulls Transactions", but typically enough the 6th one was a broken link. The top 2 candidates referred to the bulls, the 3rd one was a non-official "trade and free agent form" (from geocities) not for the bulls specifically, and the remaining four referred to "J.A.C.K" ( Journal of Associated Construction Knowledge, Publications for the trades) most of which being broken links but in any case nothing to do with basketball. By comparison, the results obtained by Fetuccino, as shown in Figure 5, were much more precise. Thanks to the quality of the seeds or domain pages returned by clever, Fetuccino could find in their vicinity precise information about trades in the right domain.
We discovered that the two phase approach is particularly useful when focused queries have another meaning (as with the "trades" focused query in the previous example) in a given domain, or get a different perspective, as in the example below. In Figure 6, we see the results for the domain/focused query: "Sun Corporation/McLaren team". This map evidences very good results in a very short time span. The top candidate in the close neighborhood entitled "Accelerated Success", has as subtitle (when the page is visited) "McLaren racing team picks up speed with Sun". The good quality of the results here is due to the fact that 1. the top candidate returned by Clever was an excellent authority namely the Sun home page, and 2. the McLaren deal information was in the close neighborhood of the authority. In such a simple case, a user skilled in Web searches would have found the same information manually by going to the authority (that he should know) and doing two hops. Fetuccino automated these natural steps for the user and added a convenient interface where, for instance, all relevant pages can be printed in one click.
Figure 6: Fetuccino results for query "Sun Corporation/McLaren deal"
However, the two phase approach is not that successful when the authorities are not "hubby" enough or the "hubs" are not authoritative enough. Indeed, when trying the two phase query domain: "Intel Architecture" and focus: "Merced", we did not get any good results on Merced from the second stage focused search. When analyzing the reason of that failure we found out (by searching the Intel site) that documents pertinent to the Merced architecture were on an Intel developer site that was not among the retrieved authorities. The main Intel authority (www.intel.com) did point to that developer site, but it would have required too many "hops" for the dynamic crawling to find the relevant documents. In other words, when the authority seeds are too far from the relevant information, or the domain view is too large, the dynamic search looses its efficiency and relevant information is not found in a decent amount of time.
It is difficult to evaluate precisely the effectiveness of our two phase
search without adequate benchmarking, which is in itself a complicated issue for the Web
(while in the IR community, the TREC approach gives a good measure for evaluating recall
and precision). However, as a simple indication, we show here some precision results for
the 10 best candidates returned by a few Web Search Engines (namely Excite, and Infoseek,
and more recent ones like Clever and Google) as well as by Fetuccino, over the same set of
|Leonardo da Vinci/inventions||0||0||0||.1||0||.3|
career and employment opportunities
|Sun Corporation/McLaren team||0||.2||0||0||.1||.2|
1Clever top 5 authorities and top 5 hubs.2 Clever top 10 authorities.
These test cases having been designed experimentally (rather than formally over a
statistically significant sample, and using testing guidelines procedure), they should be
considered as a positive indicator of the effectiveness of the algorithm rather than as
formal results. As mentioned before, our results are dependent on the quality of Clever
results. The first conclusion we drew from those experiments that it might be worthwhile
to examine and compare various authority engines so as to get the best possible seeds.
The accelerated growth of the Web might cause traditional purely static search engines to become less accurate and less effective over the time, even if their indexing, retrieval, and storage techniques are likely to improve. Fetuccino addresses this shortcoming by augmenting static search services with text-based dynamic exploration around the vicinity of the search results, thereby discovering new relevant information and validating old information. The key to making an effective use of the limited exploration time, which is typically bounded by the (im)patience of the interactive user, is the use of text-based relevance check in order to control the exploration process itself, in addition to using it for accurate relevance checking for scoring purposes.
On top of the basic dynamic search mechanism, we added a novel capability: a user-defined domain search, followed by a dynamic focused search within the domain. Although our current solution is only a crude approximation of truly domain-specific search, it nevertheless produces promising results that could improve advanced static methods.
We believe that adding a dynamic search capability opens the door for many other future enhancements, of which domain-based search is only a starting point. For example, assuming that a standard interface for site search, like WAIS used to provide, would reappear, the identification of the domain could be much improved by dispatching the queries to various intra-site search engines, in a similar way that MetaCrawler , , dispatches queries to search services.
We thank Jon Kleinberg and Ron Fagin for useful discussions on Hubs and Authorities (and Ron Pinter for putting us in contact). We are also grateful to Prabhakar Raghavan for letting us use the Clever system. Finally, we are in debt to Dirk Nicol for hosting Mapuccino and Fetuccino on the IBM Corporate Java Site. This research was done while Dan Pelleg was an extern student at the IBM Haifa Research Laboratory.
|Issy Ben-Shaul is a faculty member in the department of Electrical Engineering at the Technion -- Israel Institute of Technology, and a consultant in the Information Retrieval and Organization Group at the IBM Haifa Research Lab. He received his BSc in Mathematics and Computer Science from Tel Aviv University in 1988, and his MS and PhD in Computer Science from Columbia University in 1991 and 1995, respectively. During 1995, before joining the Technion, Ben-Shaul was a research staff member at the IBM research laboratory in Haifa, and worked on applications and extensions of clustering technology to the Internet. He is leading the Distributed Systems Group and the associated software systems laboratory at the Technion. His research interests include distributed and mobile systems, software engineering, information retrieval, Web, advanced transactions, workflow management systems and electronic commerce. He has published over 30 papers in refereed journals and conference|
|Michael Herscovici is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel and belongs to the "Information Retrieval and Organization" Group. His research interests include Internet applications and parsing techniques. Mr. Herscovici received his B.Sc. in Computer Science from the Technion, Israel institute of technology in Haifa, in 1998. He joined IBM in 1997 and has since worked on the dedicated robot component of Mapuccino, a Web site mapping tool.|
|Michal Jacovi is a Research Staff Member at the IBM Haifa
Research Lab in Haifa, Israel, and belongs to the "Information Retrieval and
Organization" Group. Her research interests include Internet applications, user
interfaces, and visualization. She received her M.Sc. in Computer Science from the
Technion, Haifa, Israel, in 1993. Ms. Jacovi has joined IBM in 1993, and worked on several
projects involving user interfaces and Object Oriented, some of which have been published
in journals and conferences.
Since the emergence of Java, she has been involved in the conception and implementation of Mapuccino, a Web site mapping tool, written in Java, that is being integrated into several IBM products.
Yoelle S. Maarek is a Research Staff Member at the IBM
Haifa Research Lab in Haifa, Israel and manages the "Information Retrieval and
Organization" Group that counts about 15 members. Her research interests include
information retrieval, Internet applications, and software reuse. She graduated from
the "Ecole Nationale des Ponts et Chaussees", Paris, France, as well as received
her D.E.A (graduate degree) in Computer Science from Paris VI University in 1985.
|Dan Pelleg received his B.A. in 1995 and his MSc in 1998
from the Department of Computer Science, Technion, Haifa, Israel. His Master's thesis
topic was "Phylogeny Approximation via Semidefinite Programming". He is
currently a PhD candidate in the CS Dept at Carnegie-Mellon University.
His research interests include computational biology, combinatorial optimization and Web-based software agents.
During the summers of 1997 and 1998, Dan worked as an extern student in IBM Haifa Research Laboratory.
|Menachem Shtalhaim is a Research Staff Member at the IBM
Haifa Research Lab in Haifa, Israel and belongs to the "Information Retrieval and
His research interests include Internet applications, communication protocols and heuristic algorithms.
Mr. Shtalhaim joined IBM in 1993, and worked on several projects involving morphological analysis tools, Network Design and analysis tool (IBM product NetDA/2) and the AS400 logical file system layer.
In the past, Mr. Shtalhaim has worked on medical diagnostic systems. He is the author of the dedicated robot component of Mapuccino, a Web site mapping tool
|Vladimir Soroka is a Research Staff Member at the IBM
Haifa Research Lab in Haifa, Israel and belongs to the "Information Retrieval and
Organization" Group. His research interests include Internet applications and
Information organization. Mr. Soroka received his B.Sc. in Computer Science from the Technion, Israel Institute of Technology in Haifa, Israel in 1996. Before joining IBM, Mr. Soroka worked on Internet-based fax servers.
He joined IBM in 1998 and has since worked on various applications such as Mapuccino, a Web site mapping tool.
|Sigalit Ur is a Research Staff Member at the IBM Haifa
Research Lab in Haifa, Israel, working on Mapuccino, a Web site mapping tool, written in
Java, that is being integrated into several IBM products.
She received a Master of Science degree in Intelligent Systems from the University of Pittsburgh in 1993.
Before joining IBM, Ms. Ur was involved in projects in a wide variety of fields, including data processing, databases, cognitive science, multi-agent planning and image processing, some of which have been published in journals and conferences.
 K. Bharat, A. Broder, M. Henzinger, P. Kumar and S. Venkatasubramanian. "The Connectivity Server: Fast Access to Linkage Information on the Web". In the Proceedings of the Seventh International World Wide Web Conference, Brisbane, April 1998.
Alexa - Technology http://www.alexa.com/support/technology.html