The widespreading of XML as a standard for semi-structured documents on the Web opens up challenging opportunities for Web query languages. In this paper we introduce XML-GL, a graphical query language for XML documents. The use of a visual formalism for representing both the content of XML documents (and of their DTDs) and the syntax and semantics of queries enables an intuitive expression of queries, even when they are rather complex. XML-GL is inspired by G-log, a general purpose, logic-based language for querying structured and semi-structured data. The paper presents the basic capabilities of XML-GL through a sequence of examples of increasing complexity.
Keywords: XML, Query Languages, Graphical Queries
XML  is a recent recommendation of the WWW Consortium for a meta-language to define mark-ups for content publishing on the Web. The design goals of XML are driven by the half-a-decade experience of usage of HTML as a content description language, which has exposed several inadequacies:
XML addresses both problems by letting content producers define and use the set of tags that best mirrors the structure and conceptual properties of the content they want to publish.
The shift from HTML to XML brings a major change in the structure of Web information, which becomes more and more a collection of semi-structured objects, i.e., pieces of content for which at least a partial representation of structure (known as schema) is available. This evolution brings forth the necessity of novel languages for extracting information from XML content, much in the same way as traditional query languages (notably SQL) have been used for extracting information from structured data stored in databases.
As in database applications, the purpose of a query language is twofold:
XML-GL addresses both issues, by permitting the formulation of queries for extracting information from XML documents and for restructuring such information into novel XML documents.
The originality of XML-GL with respect to other proposals for querying XML documents, like XML-QL , XQuery , XQL [8,16], and XSL , is that queries are formulated visually, using a graph-based formalism close to the structure of XML documents (e.g., comparable to the visual representations of XML documents offered by XML authoring tools like the Near & Far XML editor). However, XML-GL is not a visual interface over a conventional, textual, query language, but a graph-based query language with both its syntax and semantics defined in terms of graph structures and operations [2,3,4,14].
Prior to introducing the features of XML-GL, we propose a set of requirements for the design of a query language over XML documents.
An XML document can be compliant to a Document Type Definition (DTD), that specifies the types of mark-up elements that can appear in the document, their attributes and containment relationships. If an XML document adheres to a DTD, it is said to be valid. If an XML document lacks a DTD but respects some syntactic rules for tag placement, it is said to be well formed.
For coherence with the visual nature of XML-GL, we introduce an explicit data model for XML documents, called XML-GDM (XML Graphical Data Model), which we use to represent both the expected structure of XML documents (i.e., their DTDs) and actual documents. XML-GDM syntax will be used also (with a few additional graphical notations) for writing XML-GL queries and for representing the DTD of their result, with the benefit of reducing to the minimum the notations that the user should learn to query XML documents.
Well-formed documents without a DTD can be queried in the same way as valid documents: however, since no information about the structure of the document is available at query formulation time, it is more likely that queries may not match exactly the structure of the document and result in empty answers.
Consider the DTD shown in Figure 1, which specifies the structure of documents containing book orders. For each order, the information about the consignee, the ordered items, the date, and possibly a contact address are given. Each item contains the information about the book, the quantity, and possibly the discount percentage. For each book, the ISBN code, the price, and possibly the title and the authors are known. A possible XML document conforming to this DTD is shown in Figure 2.
The XML-GDM data model consists of three concepts: objects, relationships, and properties.
For the sake of explanation, we classify XML content into four categories:
XML attributes are also classified as object-identifiers, if their type is ID, object-valued, if their type is IDREF or IDREFS, or printable attributes, otherwise.
The correspondence between an XML DTD and an XML document and an XML-GDM graph is established by the following rules3:
According to the above rules, the DTD of Figure 1 is represented in XML-GDM as shown in Figure 3.
XML-GDM supports the representation of actual XML documents with the same formalism as for DTDs, except that cardinality constraints and disjunctions need not be represented, since a particular element occurrence always has a specific set of sub-elements and a particular choice of alternatives in the representation. A piece of the instance of Figure 2 is represented in Figure 4.
XML-GL is a query language for XML-GDM data. An XML-GL query can be applied either to a single XML document or to a set of documents, e.g., those composing a Web site. The query produces a new XML document as the result. Thus, the execution of a query results in a transformation of the source XML document(s) into a new XML document. An XML-GL query consists of four parts:
Graphically, an XML-GL query is a pair of XML-GDM graphs, displayed side by side and separated by a vertical line; the left-side graph visually represents the extract and match parts, while the right-side graph conveys the clip and construct parts. This separation sharply evidences those concepts which are used to extract elements from the target documents and those concepts which are used to construct the result documents produced by the query. Indeed, the right-side graph represents the DTD of the result.
In the following sections, we progressively introduce the features of XML-GL by means of queries with increasingly complex structures. First, we will show simple queries that extract elements from target documents and produce result documents in different ways (extract-clip queries, Section 3.1); then, we show queries that apply filtering conditions to the extracted elements to be included in the result (extract-match-clip queries, Section 3.2); finally we will introduce queries that define arbitrarily structured result documents, composed of new elements, possibly intermixed with elements extracted from the target documents (extract-match-construct-clip queries, Section 3.3).
The simplest form of XML-query is the extract-clip, which extracts a portion of an XML document and produces as output a new document containing the extracted data.
The query of Figure 5 finds all the BOOK elements from a specified set of documents over the Web. Its result is shown in Figure 7.b.
In extract-clip queries, the left-side graph contains the extract part of the query. In the example, the extract part operates on the single target element book. More generally, the extract part of a query may contain several target elements, represented as root nodes of the left-side graph.
Target elements may optionally contain the indication of the URLs of the document or set of documents that should be used as input in order to evaluate the query. For convenience, URL in queries may contain wildcards; in the query of Figure 5, the string ``http://www.elet.polimi.it/ceri/ord*.xml'' inside object BOOK makes the query target only those XML documents in the ceri directory of host www.elet.polimi.it, whose name starts with the string ord.
The right-side graph expresses the clip part, which defines the DTD of the result document as an XML-GDM graph, out of the structure of the target elements mentioned in the extract part. When an object mentioned in the extract part should belong to the result of the query, it must be included also in the clip part. The correspondence between left-side and right-side objects is by name (as for object BOOK in the example of Figure 5); if this introduces ambiguity (e.g., for queries using the same elements multiple times in the extract part), then the one-to-one correspondence may be made explicit by drawing an edge that connects the corresponding elements of the two graphs.
The meaning of the one-to-one correspondence is that the result of the query will be constructed -- according to the structure dictated by the right-side graph -- using exactly those object instances which are selected by the extract part and are mentioned in the clip part. In the example of Figure 5, all books found in the target XML documents are used to build the result.
When an extracted element is used to build the result, the clip part must also specify which sub-elements should be retained and which should be discarded. To make the clip part more concise, the following shorthand notations are defined, represented in Figure 6:
The results produced by the above clip parts applied to the set of all books in the running example document are illustrated in Figure 7.
The match part extends the left-side graph of the query with the possibility of expressing a large class of selection predicates. These are described by means of a well-defined collection of graphical notations which enable the expression of existential conditions (e.g., the requirement that a sub-element exists), or predicates on element's properties (e.g., required values for attributes and PCDATA content); all predicates are implicitly in conjunctive form.
The condition of a query normally involves several sub-elements of the target elements in the extract part; these elements are evidenced in the left-side graph. This operation is eased by the presence of the XML-GDM representation of the DTD of the input document(s), which exactly gives the needed graphical representation of the elements' internal structure. The same notation is also used to specify the sub-elements to be kept in the clip part, as shown in the previous section. Thus, query construction can be easily supported by a "drag-and-drop" interface, starting from the construction of the graph in the left part of a query, and then proceeding to the right-side graph.
The query of Figure 8 finds orders containing the book titled "Introduction to XML", to be shipped to an address in Los Angeles, and presents such orders with their shipping and item information5.
The document produced as result of the previous query is the following:
<ORDER number=2> <SHIPTO> <FULLADDRESS><COMPANY>ASA</COMPANY><CITY>Los Angeles</CITY> <ADDRESSLINE>18 Harvard str.</ADDRESSLINE> </FULLADDRESS> </SHIPTO> <ITEM> <BOOK><ISBN>15536455</ISBN> <TITLE>Introduction to XML</TITLE> <PRICE>24.95</PRICE> <AUTHOR><FIRSTNAME>Charles</FIRSTNAME><LASTNAME>Porter</LASTNAME></AUTHOR> </BOOK> <QUANTITY>6</QUANTITY> <DISCOUNT>.40</DISCOUNT> </ITEM> <ITEM> <BOOK><ISBN>15532155</ISBN> <TITLE>Introduction to Internet</TITLE> <PRICE>22.50</PRICE> <AUTHOR><FIRSTNAME>Steve</FIRSTNAME><LASTNAME>Andrews</LASTNAME></AUTHOR> </BOOK> <QUANTITY>10</QUANTITY> <DISCOUNT>.42</DISCOUNT> </ITEM> </ORDER>
Note that all orders appearing in the result have both an address and (at least) one item satisfying the extract-match part.
The condition in the match part may also involve the application of boolean operators to attributes and PCDATA properties. To this end, the match part may use the comparison operators (>, <, >=, <=, = and <>) and the string operators (_ and %). The match part can also be used to write queries targeted to several elements, similar to select-join queries of SQL.
The query of Figure 9 finds all books written by an author with the same last name as a person whose name starts with 'S'.
The join condition on last names is expressed in the match part by expanding the graph of the BOOK element to show the inner AUTHOR element, and connecting the author's and person's last names; note that for both the AUTHOR and PERSON elements, lastname is not an XML attribute, but a piece of XML data.
Queries involving element identity and references between elements are easily represented, based on the treatment of element identity explained in Section 2. Element identity is defined when an element has an ID attribute and other elements refer to it using IDREF(S) attributes: in this case, a query may search for IDs that ``point'' to the same element, i.e., that have the same value, as demonstrated in the following example.
The query of Figure 10 finds the persons referenced as contact in an order.
The query extracts those orders containing a contact that includes a reference (order #2, in the running example), and pairs them to the person whose ID matches the customer attribute of the reference element. These persons are then included in the result in the clip part.
Element identity can also be used in join conditions: the following query exploits IDREF attributes to ``join'' information of orders.
The query of Figure 11 finds the orders shipped to persons who are also contacts in another order.
Note that in this case the extract-match part of the query contains two ORDER elements, and ambiguity is avoided by explicitly connecting only one of them to its counterpart in the clip part. This technique compares with the use of alias (with the keyword as) in SQL queries, in order to disambiguate multiple occurrences of the same table or column name by associating each of them to a different variable.
Both positive and negated conditions are admitted: to express that a condition is negative, this is drawn using dashed lines, as in the following example.
The query of Figure 12.a finds all books having a title, while the query of Figure 12.b finds the books with unknown title.
In the preceding examples, conditions in the match part always specified the name of the element to be extracted/matched: however, if a query requires to express a condition on a generic object, dummy nodes are used. They are represented as unlabeled nodes which are matched against ANY element.
The query of Figure 13
finds all the elements that contain a book and includes them
in the result with their PCDATA content and terminal
So far, XML-GL queries have produced very simple result documents, defined by a subset of the elements extracted from target documents in the extract part. However, XML-GL can be used to produce more sophisticated result documents which:
From a document processing point of view, the semantics of the construct part of XML-GL queries is similar to that of a transformation program that converts a tagged document into another one by means of pattern matching and rewriting, as proposed for instance in the DSSSL (Document Style Semantics and Specification) language . However, while DSSSL provides its transformation language within a stylesheet-based environment for rendering and processing SGML documents, the construct part of XML-GL is concerned with restructuring alone, cleanly separating the transformation from the presentation issues. This corresponds to the functional decomposition envisioned in recent proposals for XML-based Web application environments . Indeed, XML-GL construction can be considered as a complement to current capabilities of XSL (XML Style sheet Language) , which again is mainly concerned with XML documents rendering.
The simplest form of construction consists of embedding elements extracted in the extract-match part into new elements. Three types of embedding are possible:
Consider the three queries of Figure 14. They find all the existing persons that have an address; with element construction (a) one instance of the new element (called RESULT) is created for each person satisfying the given condition and contains the person's data according to the clip specification; with the list construction (b) a single element (also called RESULT) is created which contains the list of persons satisfying the match part, along with their nested sub-elements as specified by the clip specification; with the grouping list construction (c) the resulting persons are grouped by city and one element (named RESULT) is introduced for each group. Formally, CITY induces a partition of the occurrences of PERSON, where each distinct value of CITY is mapped to a distinct subset of persons.
Thus, the results of the three queries applied to the document of Figure 2 are the following:
|Query A: element construction||Query B: list construction||Query C: grouping list construction|
<RESULT> <PERSON id="C00001"> <FIRSTNAME>Robert</FIRSTNAME> <LASTNAME>Moore</LASTNAME> </PERSON> </RESULT> <RESULT> <PERSON id="C00002"> <FIRSTNAME>Tom</FIRSTNAME> <LASTNAME>Smith</LASTNAME> </PERSON> </RESULT> <RESULT> <PERSON id="C00003"> <FIRSTNAME>Steve</FIRSTNAME> <LASTNAME>Andrews</LASTNAME> </PERSON> </RESULT>
<RESULT> <PERSON id="C00001"> <FIRSTNAME>Robert</FIRSTNAME> <LASTNAME>Moore</LASTNAME> </PERSON> <PERSON id="C00002"> <FIRSTNAME>Tom</FIRSTNAME> <LASTNAME>Smith</LASTNAME> </PERSON> <PERSON id="C00003"> <FIRSTNAME>Steve</FIRSTNAME> <LASTNAME>Andrews</LASTNAME> </PERSON> </RESULT>
<RESULT> <PERSON id="C00001"> <FIRSTNAME>Robert</FIRSTNAME> <LASTNAME>Moore</LASTNAME> </PERSON> <PERSON id="C00002"> <FIRSTNAME>Tom</FIRSTNAME> <LASTNAME>Smith</LASTNAME> </PERSON> </RESULT> <RESULT> <PERSON id="C00003"> <FIRSTNAME>Steve</FIRSTNAME> <LASTNAME>Andrews</LASTNAME> </PERSON> </RESULT>
Figure 15 pictorially summarizes the three different construction primitives and how they build the result from the extracted elements. The first option lists all the selected elements, the second option builds a result element containing all the selected persons, and the third option presents them according to the partitioning produced by the grouping criterion.
The query of Figure 16 demonstrates the orthogonal combination of construction primitives; it is a variant of Figure 14.c: it groups persons by city and embeds these groups into a new element called RESULT containing also the name of the city.
The query applied to the document of Figure 2 returns the following result:
<RESULT> <CITY>Los Angeles</CITY> <PERSON id="C00001"> <FIRSTNAME>Robert</FIRSTNAME> <LASTNAME>Moore</LASTNAME> </PERSON> <PERSON id="C00002"> <FIRSTNAME>Tom</FIRSTNAME> <LASTNAME>Smith</LASTNAME> </PERSON> </RESULT> <RESULT> <CITY>San Francisco</CITY> <PERSON id="C00003"> <FIRSTNAME>Steve</FIRSTNAME> <LASTNAME>Andrews</LASTNAME> </PERSON> </RESULT>
XML-GL can also be used to restructure existing documents, e.g., by including elements of one document into another one or extending the elements of one document with information originating from related elements inside the same document.
The query of Figure 17 finds the orders containing a book whose author's first name and last name appear also in an element of type PERSON and produces a new element EXTORDER where the address is added to each author.
The result of this query applied on the document of Figure 2 is the following:
<EXTORDER> <ITEM> <BOOK> <AUTHOR> <FIRSTNAME>Steve</FIRSTNAME> <LASTNAME>Andrews</LASTNAME> <FULLADDRESS> <CITY>San Francisco</CITY> <ADDRESSLINE>15 Washington str. </ADDRESSLINE> </FULLADDRESS> </AUTHOR> </BOOK> </ITEM> </EXTORDER>
<EXTORDER> <ITEM> <BOOK> <AUTHOR> <FIRSTNAME>Steve</FIRSTNAME> <LASTNAME>Andrews</LASTNAME> <FULLADDRESS> <CITY>San Francisco</CITY> <ADDRESSLINE>15 Washington str. </ADDRESSLINE> </FULLADDRESS> </AUTHOR> </BOOK> </ITEM> </EXTORDER>
In the result, one element EXTORDER is constructed for each group of items belonging to the same order retrieved in the extract-match part. Items in the result are the same as those retrieved in the extract-match (as the by-name correspondence indicates), but for a fact: each AUTHOR sub-element is extended with the inclusion of the FULLADDRESS element coming from the corresponding PERSON object retrieved in the extract-match part.
XML-GL offers a rich set of additional features, like unnesting and nesting of XML objects, element ordering, data sorting, arithmetic functions, and aggregate functions. In  these additional features are presented, together with a description of the semantics of the language.
The huge amount of data published via the World Wide Web has led to a number of research efforts on techniques to index, query and restructure Web sites contents. In this section we provide a brief overview of related work on XML query languages and, more generally, on query languages for the Web (see also ).
A considerable amount of research has been made on how to complement keyword-based searching with database-style support for querying the Web. Several projects addressed this problem, and three main Web query languages have been proposed so far: Web3QL , WebSQL  and WebLog . The first two languages are modeled after standard SQL used for relational DBMSs, while the third retains the flavor of the Datalog language.
In the specific domain of XML documents, proposals for query languages are in their infancy. Several preliminary contributions and position papers are collected in . Among the discussed approaches, we review XML-QL , XQL  by Microsoft, Texcel, and webMethods, XQuery by Inso , and XQL  by Fujitsu.
The XML-QL language  has been submitted for evaluation to the WWW Consortium by a pool of researchers. XML-QL provides a textual syntax for writing queries that construct new XML documents from target documents. The expressive power of XML-QL is comparable to that of XML-GL, but the former has a different syntactic flavor based on the use of pattern-matching expressions and variables ranging over content and tag names to extract content from target documents and embed it into the result of a query.
The XML Query Language (XQL)  is a notation proposed by several companies for addressing and filtering the elements and text of XML documents. XQL is an extension to the XSL pattern syntax . The basic idea is to provide a syntax to locate nodes (elements and text) within an XML document, using a notation inspired by directory path expressions. XQL relies on path expressions, filters, and methods to achieve an effect similar to the extract-match part of an XML-GL query. Conversely, there is no counterpart in XQL for the construction of new documents, as provided by the clip-construct part of XML-GL queries.
XQuery  is a query language proposed by Inso Corporation for extracting information from XML documents. XQuery draws its syntax from the XPointer document linking language . XQuery provides a rich type system for representing XML content and defines the output of queries as locations, which can be sets of XML elements, attributes, or spans of text. A query consists of a sequence of steps: each step selects nodes, either in absolute terms or based on the output of the preceding step. Examples of steps are: ``select the element with the given id'' or ``select from among the direct children of the current location''. Sequences of steps are joined by the dot operator, like in OQL path expressions. XQuery most advanced features address the use of links in queries and of regular expressions to write order-sensitive queries.
XQL  by Fujitsu takes a different approach to XML document querying, by proposing a syntax which extends well-known database query languages (SQL and OQL) to address the features of XML data. XQL has a select-from-where construct extended with tag variables, path expressions, and URL specification. XQL also includes primitive for the construction of output documents (inclusive of grouping) comparable to the construct-clip part of XML-GL queries.
XML-GL is a sophisticated, but intuitive, visual language for querying XML data sources. It draws its unique features from an original combination of orthogonal, natural primitives for visualizing DTDs and documents, extracting their content, producing new content from extracted data, and formatting query results in complex ways. The use of a visual interface and language for querying XML-based Web documents seems very appealing.
Our research activity will concentrate on the following directions. We will consolidate the language and design a textual version with equivalent expressive power; we have great interest in using a query language jointly designed by the Web community, but a consensual language has to emerge and the language should support our graphical constructs in a natural way. We will also address the requirements not currently satisfied by our proposal, such as queries over arbitrarily linked documents and metadata, and flexible query interpretation and expansion for non-exact document matching. Finally, we will concentrate on the deployment of the language within a Web-based visual environment, by studying an effective query interface supporting the automatic display of DTDs and/or documents and a collection of graphical primitives for clipping and dragging schema elements and for incrementally constructing the query graphs.