A Comparison of Schemas for Video Metadata Representation


Jane Hunter, CITEC, 317 Edward St Brisbane, Qld, 4001, Australia. Phone +617 33654310, Fax +617 33654311 jane@dstc.edu.au
Liz Armstrong, DSTC, Level 7, GP South, Uni of Qld, Qld, 4072, Australia. Phone +617 33654310, Fax +617 33654311 liz@dstc.edu.au


Abstract

To enable the resource discovery of audiovisual documents over the WWW, it will be necessary to define content description standards or metadata standards for complex, multi-layered, time-dependent information-rich audiovisual data streams. In particular, this is the primary goal of the emerging MPEG-7 standard, the "Multimedia Content Description Interface" [1], under development by the MPEG group. In the past, a lot of effort has gone into generating descriptors and description schemes for video indexing but comparatively little research has been done on schemas capable of defining the structure, content and semantics of video documents and enabling validation and higher levels of automated content checking. This paper compares the capabilities of the RDF Schema, Extensible Markup Language (XML) Document Type Definitions (DTD's), Document Content Description (DCD) and Schema for Object-Oriented XML (SOX), for supporting and validating hierarchical video descriptions based on Dublin Core, MPEG-7 and a specific hierarchical structure. Finally this paper proposes a hybrid schema based on features from each of these schemas which will satisfy the MPEG-7 Description Definition Language (DDL) requirements.


Keywords

Video, Metadata, Schema, Dublin Core, MPEG-7


1. Introduction

To enable the resource discovery of audiovisual documents over the WWW, it will be necessary to define content description standards or metadata standards for complex, multi-layered, time-dependent information-rich data streams. In particular, this is the primary goal of the developing MPEG-7 standard, the "Multimedia Content Description Interface" [1], under development by the MPEG group.

A number of papers have considered the application of Dublin Core (DC) and the Resource Description Framework (RDF) to video indexing [2, 3, 4, 5]. An example of such an application is described briefly below. However, very little work has been done on defining schemas which are capable of actually validating and constraining video descriptions and their associated data models. Such schemas will be necessary for the development of cost-efficient, user-friendly, semi-automatic metadata generation and editing tools for video. Such a schema would also provide a solution for the Description Definition Language (DDL) component of the MPEG7 requirements.

This paper first briefly presents a video description scheme based on Dublin Core and MPEG-7. From this description format, a list of schema requirements are generated. It then compares the ability of a number of existing schemas and schema proposals, including the RDF Schema, XML DTDs, DCD and SOX, to satisfy descriptions of hierarchical video structures. Examples of schema definitions are given to illustrate their capabilities.

Finally this paper proposes a hybrid schema based on specific features from each of these schemas and schema proposals which would satisfy the MPEG-7 Description Definition Language (DDL) requirements.


2. Proposed Video Description Scheme

Dublin Core was designed specifically for generating metadata to facilitate the resource discovery of textual documents. Although a number of workshops have been held to discuss the applicability of Dublin Core to non-textual documents such as images, sound and moving images, they have primarily focused on extensions to the 15 core elements through the use of subelements and schemes specific to audiovisual data, to describe bibliographic-type information rather than the actual content.

It has been shown [2] that it is possible to describe both the structure and fine-grained details of video content by using the fifteen Dublin Core elements plus qualifiers and encoding this within RDF . This "pure Dublin Core" approach provides multiple levels of descriptive information. At the top level the 15 basic Dublin Core elements can be used to describe the bibliographic type information about the complete document (e.g. Title, Author, Contributor, Date etc.). This enables non-specialist inter-disciplinary searching, independent of media type. Extensions or qualifiers to specific DC elements (Type, Description, Relation, Coverage) can be applied at the lower levels (scenes, shots, frames) to provide fine-grained, discipline- and media-specific searching (e.g. Description.Camera.Angle). The disadvantage of this approach is that the semantic refinement of Dublin Core through the use of qualifiers eventually leads to a loss of semantic interoperability.

The alternative is a "hybrid" approach in which RDF (or some other framework) is used to combine both simple unqualified Dublin Core and MPEG-7 descriptors within a single description container. Dublin Core can be used for generic media-independent search and retrieval while MPEG-7 can be used for object-specific fine-grained queries. Our future research will compare and evaluate these two approaches for multimedia resource discovery and determine the best balance between semantic interoperability, extensibility and modularity. At this stage, we don't know the specific attributes of each level, we can only assume that each structural component will possess both a set of Dublin Core attributes plus a set of MPEG-7 attributes, as illustrated in Figure 1 below.

For example, if DC.Type = "Image.Moving.TV.News.Scene" then valid descriptors will include both the DC simple elements plus MPEG-7 descriptors such as script, transcript, editlist, keyframe etc. If DC.Type = "Image.Moving.TV.News.Scene.Shot" then valid descriptors will include both the DC elements plus keyframe, camera_distance, camera_angle, camera_motion, opening_transition, closing_transition. If DC.Type = "Image.Moving.TV.News.Scene.Shot.Frame" then a valid descriptors will be the DC elements plus colour_histogram.

Figure 1 shows the logical structure, the structural components and their associated Dublin Core attributes and some assumed MPEG-7 attributes for the proposed video description scheme.

Figure 1: Multilayered Hierarchical Structure and Attributes of Video

3. Video Metadata Schema Requirements

In order to represent the video structure and Dublin Core descriptors outlined in Figure 1, a suitable schema must be able to support the following:

These requirements are similar and compatible with the DDL requirements listed in section 4.1.1 of the MPEG-7 Requirements Document [7].


4. Resource Description Framework (RDF) Schema

The Resource Description Framework (RDF) enables interoperability between applications which exchange machine-understandable information on the Web. A model for representing metadata as well as a syntax for encoding RDF, based on XML has been defined in the RDF Model and Syntax Specification document [8].

RDF is based on a resource and property data model system. A collection of classes (typically authored for a specific purpose or domain) and the definition of their properties (attributes) and corresponding semantics represent an RDF schema. A schema defines not only the properties of the resource or class (Title, Author, Subject, Size, Color etc.) but also may define the kinds of resources being described (books, webpages, people, companies, etc.). The details of RDF schemas have been defined in the RDF Schema Specification document [9].

Classes are organized in a hierarchy, and offer extensibility through subclass refinement. This way, in order to create a schema slightly different from an existing one, one can just provide incremental modifications to the base schema. Through the sharability of schemas RDF will support the reusability of metadata definitions. Due to RDF's incremental extensibility, agents processing metadata will be able to trace the origins of schemes they are unfamiliar with back to known schemes, and perform meaningful actions on metadata they weren't originally designed to process. The sharability and extensibility of RDF also allows metadata authors to use multiple inheritance to "mix" definitions, to provide multiple views to their data, taking advantage of work done by others. The XML namespace mechanism serves to identify different RDF Schemas.

RDF schemas can be compared to XML Document Type Descriptions (DTDs). Unlike an XML DTD, which gives specific constraints on the syntactical structure of a document, an RDF schema provides semantical information about the interpretation of the statements given in an RDF data model. Given its goals, RDF appears to be the ideal approach for supporting descriptors from multiple description schemes simultaneously, as required by the MPEG-7 DDL.


4.1 Example of a Suitable RDF Schema

This section describes an RDF schema definition that attempts to map to the diagram in Figure 1 and support the requirements listed above.

Since we want the DC simple attributes to be applicable to every component or layer, videos, sequences, scenes, shots, frames and objects are all sub-classes of a top level document class which possesses the DC attributes. In addition each sub-class has its own additional descriptive properties or attributes which will correspond to MPEG-7 descriptors when they become available.

<rdf: RDF
	xmlns:rdf = "http://www.w3.org/TR/WD-rdf-syntax#"
	xmlns:rdfs= "http://www.w3.org/TR/WD-rdf-schema#"
	xmlns:dc= "http://purl.org/metadata/dublin_core#">

<rdfs:Class ID="MM_document">
<rdfs:comment>Class for representing a generic multimedia document</rdfs:comment>
</rdfs:Class>

<rdfs:comment>Define all of the DC elements for MM_document </rdfs:comment>

<rdf:PropertyType ID="Title"> <rdfs:comment>This is the DC Title element </rdfs:comment> <rdfs:domain rdf:resource="#MM_document"> <rdfs:range rdf:resource="http://purl.org/metadata/dublin_core#Title"/> </rdf:PropertyType>
<rdf:PropertyType ID="Creator"> <rdfs:comment>This is the DC Creator element </rdfs:comment> <rdfs:domain rdf:resource="#MM_document"> <rdfs:range rdf:resource="http://purl.org/metadata/dublin_core#Creator"/> </rdf:PropertyType> . etc. . <rdfs:Class ID="Video"> <rdfs:comment>Class for representing a video document. It is a subclass of MM_document</rdfs:comment> <rdfs:subClassOf rdf:resource="#MM_document"/> </rdfs:Class> <rdfs:Class ID="Sequence"> <rdfs:comment>Class for representing a sequence from a video document. It is a subclass of MM_document</rdfs:comment> <rdfs:subClassOf rdf:resource="#MM_document"/> </rdfs:Class> <rdfs:Class ID="Scene"> <rdfs:comment>Class for representing a scene. It is a subclass of MM_document</rdfs:comment> <rdfs:subClassOf rdf:resource="#MM_document"/> </rdfs:Class> <rdfs:Class ID="Shot"> <rdfs:comment> Class representing a shot</rdfs:comment> <rdfs:subClassOf rdf:resource="#MM_document"/> </rdfs:Class> <rdfs:Class ID="Frame"> <rdfs:comment> Represents a single frame. It is a subclass of #MM_document</rdfs:comment> <rdfs:subClassOf rdf:resource="#MM_document"/> </rdfs:Class> <rdfs:Class ID="Object"> <rdfs:comment> Represents an object within a frame. It is a subclass of #MM_document</rdfs:comment> <rdfs:subClassOf rdf:resource="#MM_document"/> </rdfs:Class>

One of the problems with RDF is to create a generic property such as contains by which the hierarchical structure can be defined i.e. videos contain sequences which contain shots which contain frames which contain objects and actors. If you create a property contains for #video then how do you also apply it to #sequence, #scene and #shot? Since each property requires a single range, then generic relationships such as contains cannot be used. Instead, a separate property must be defined for each domain-range pair. This is tedious and repetitive. The lack of class-specific constraints on domain and range of properties is a major limitation of RDF, particularly when applied to complex multilayered documents in which you want to specify constraints on structural, spatial, temporal and conceptual relationships between components.


<rdf:PropertyType ID="contains_sequences">
<rdfs:comment> Property related to a video asset stating that a video consists of a number of sequences. </rdfs:comment>
<rdfs:domain rdf:resource="#Video">
<rdfs:range rdf: resource="#Sequence">
</rdfs:PropertyType>

<rdf:PropertyType ID="contains_scenes">
<rdfs:comment> Property related to a sequence asset stating that a sequnce consists of a number of scenes. </rdfs:comment>
<rdfs:domain rdf:resource="#Sequence">
<rdfs:range rdf: resource="#Scene">
</rdfs:PropertyType>

<rdf:PropertyType ID="contains_shots">
<rdfs:comment> Property related to a scene asset stating that a scene consists of a number of shots. </rdfs:comment>
<rdfs:domain rdf:resource="#Scene">
<rdfs:range rdf: resource="#Shot">
</rdfs:PropertyType>

<rdf:PropertyType ID="contains_frames">
<rdfs:comment> Property related to a shot asset stating that a shot consists of a number of frames. </rdfs:comment>
<rdfs:domain rdf:resource="#Shot">
<rdfs:range rdf: resource="#Frame">
</rdfs:PropertyType>

<rdf:PropertyType ID="contains_objects">
<rdfs:comment> Property related to a frame asset stating that a frame consists of a number of objects. </rdfs:comment>
<rdfs:domain rdf:resource="#Frame">
<rdfs:range rdf: resource="#Object">
</rdfs:PropertyType>

Another problem is the limited data typing within RDF. There are three ways of specifying data types within RDF:

Below is an example of the RDF Schema code defining some of the scene, shot , frame and object properties. It illustrates the three data typing methods available.


<rdf:PropertyType ID="startTime">
<rdfs:domain rdf:resource="#Scene">
<rdfs:domain rdf:resource="#Shot">
<rdfs:range rdf:resource="http://wwww.w3.org/TR/datatypes#Time"/>
</rdf:PropertyType>

<rdfs:PropertyType ID="keyFrame">
<rdfs:domain rdf:resource="#Scene">
<rdfs:domain rdf:resource="#Shot">
<rdfs:range rdf:resource="http://www.w3.org/TR/datatypes#Image"/>
</rdfs:PropertyType>

<rdfs:PropertyType ID="openTrans">
<rdfs:domain rdf:resource="#Shot">
<rdfs:range rdf:resource="#Transitions">
</rdfs:PropertyType>

<rdfs:PropertyType ID="closeTrans">
<rdfs:domain rdf:resource="#Shot">
<rdfs:range rdf:resource="#Transitions">
</rdfs:PropertyType>

<rdfs:Class ID="Transitions"/>
<Transitions ID="Cut"/>
<Transitions ID="Fade"/>
<Transitions ID="Wipe"/>
<Transitions ID="Dissolve"/>

<rdfs:PropertyType ID="position">
<rdfs:domain rdf:resource="#Object">
<rdfs:range rdf:resource="http://www.w3.org/TR/datatypes#Point">
</rdfs:PropertyType>

<rdfs:PropertyType ID="shape">
<rdfs:domain rdf:resource="#Object">
<rdfs:range rdf:resource="http://www.w3.org/TR/datatypes#Polygon">
</rdfs:PropertyType>

<rdfs:PropertyType ID="colorHistogram ">
<rdfs:domain rdf:resource="#Frame">
<rdfs:domain rdf:resource="#Object">
<rdfs:range rdf:resource="http://www.w3.org/TR/datatypes#Histogram">
</rdfs:PropertyType>


4.2 Advantages of the RDF Schema for Video Metadata

RDF Schemas, within the context of this application, have the following advantages:


4.3 Limitations of the RDF Schema for Video Metadata

RDF Schema has the following problems or limitations:


5. XML DTDs

Extensible Markup Language (XML) Document Type Definitions (DTDs) provide a subset of SGML for describing documents. XML was developed by the XML Working Group under the World Wide Web Consortium (W3C) in 1996. The complete XML spec. is available from the W3C.[12].

Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup. The logical and physical structures must nest properly.

The function of the markup in an XML document is to describe its storage and logical structure and to associate attribute-value pairs with its logical structures. XML provides the document type declaration, to define constraints on the logical structure and to support the use of predefined storage units. An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it. Document type declarations are made in a Document Type Definition (DTD) file. The DTD file then contains a formal definition of a particular type of document outlining the element names and the structure of the document.


5.1 An Example of an XML DTD for Video Documents

The structure is defined in the element definitions at the top of the DTD. Each element has a set of associated attributes. All elements have an ID attribute plus the DC attributes. In addition, sequences, scenes and shots also have a set of time attributes (begin, end, duration). Each element also has its own set of level-specific attributes (which will correspond to the MPEG-7 descriptors when they become available).

<?xml version="1.0"?>

<!DOCTYPE videodoc [

<!-- hierarchical structure of videodoc --!>
<!ELEMENT videodoc (sequence*) >
<!ELEMENT sequence (scene*)>
<!ELEMENT scene (shot*)>
<!ELEMENT shot (frame*)>
<!ELEMENT frame(object*)>
<!ELEMENT object(object*)>

<!-- ID attribute for every element --!>
<!ENTITY % id_attr "id ID #IMPLIED">


<!-- Set of Dublin Core Attributes --!>
<!ENTITY % dc_attr "
        Title    CDATA #IMPLIED
        Creator  CDATA #IMPLIED
        Subject  CDATA #IMPLIED
        Description CDATA #IMPLIED
        Publisher   CDATA #IMPLIED
        Contributor CDATA #IMPLIED
        Date        CDATA #IMPLIED
        Type        CDATA #IMPLIED
        Format      CDATA #IMPLIED
        Identifier  CDATA #IMPLIED
        Source      CDATA #IMPLIED
        Language    CDATA #IMPLIED
        Relation    CDATA #IMPLIED
        Coverage    CDATA #IMPLIED
        Rights      CDATA #IMPLIED">

<!ENTITY % scene_attr "
        Transcript   CDATA #IMPLIED
        Script       CDATA #IMPLIED
        EditList     CDATA #IMPLIED
        Keyframe     CDATA #IMPLIED
        Locale       CDATA #IMPLIED
        Cast         CDATA #IMPLIED
        Objects      CDATA #IMPLIED">

<!ENTITY % shot_attr "
        Keyframe     CDATA #IMPLIED
        CameraDist   NMTOKEN #IMPLIED
        CameraAngle  NMTOKEN #IMPLIED
        CameraMotion NMTOKEN #IMPLIED
        Lighting     NMTOKEN #IMPLIED
        OpenTrans    NMTOKEN #IMPLIED
        CloseTrans   NMTOKEN #IMPLIED">

<!ENTITY % frame_attr "
        Image              CDATA #IMPLIED
        Timestamp          CDATA #IMPLIED
        ColourText         NMTOKEN #IMPLIED
        ColourHistogram    CDATA #IMPLIED
        Texture            CDATA #IMPLIED
        Annotation         CDATA #IMPLIED
        Anno_Position      CDATA #IMPLIED">

<!ENTITY % object_attr "
        Position           CDATA #IMPLIED 
        Shape              CDATA #IMPLIED
        Trajectory         CDATA #IMPLIED
        Speed              CDATA #IMPLIED
        ColourText         NMTOKEN  #IMPLIED
        ColourHistogram    CDATA #IMPLIED
        Texture            CDATA #IMPLIED
        Volume             CDATA #IMPLIED
        Annotation         CDATA #IMPLIED
        Anno_Position      CDATA #IMPLIED">

<!ENTITY % time_attr "
     begin CDATA #IMPLIED
     end   CDATA #IMPLIED
     dur   CDATA #IMPLIED">

<!ATTLIST videodoc
    %id_attr;
    %dc_attr;>

<!ATTLIST sequence
      %id_attr;
      %dc_attr;
      %time_attr;>

<!ATTLIST scene
      %id_attr;
      %dc_attr;
      %scene_attr;
      %time_attr;>

<!ATTLIST shot
      %id_attr;
      %dc_attr;
      %shot_attr;
      %time_attr;>

<!ATTLIST frame
      %id_attr;
      %dc_attr;
      %frame_attr;>

<!ATTLIST object
      %id_attr;
      %dc_attr;
      %object_attr;>

]>


5.2 Advantages of XML DTDs for Video Metadata


5.3 Disadvantages of XML DTDs for Video Metadata


6. Document Content Description (DCD) for XML

The Document Content Description (DCD) [15] facility for XML is an RDF vocabulary designed for describing constraints to be applied to the structure and content of XML documents. It consists of a set of properties used to constrain the types of elements and names of attributes that may appear in an XML document, the contents of the elements and the values of the attributes. It was designed to provide semantics over and above the purely syntactical XML DTDs. It was also designed to be conformant with the RDF Model and Syntax Specification (with some simplifications). DCD also incorporates a subset of an earlier submission to W3C, the XML-Data Submission [16].

The introduction to the XML-Data Submission says that it "describes an XML vocabulary for schemas, that is, for defining and documenting object classes. It can be used for classes which are strictly syntactic (for example, XML) or those which indicate concepts and relations among concepts (as used in relational databases, KR graphs and RDF). The former are called 'syntactic schemas;' the latter 'conceptual schemas'." Thus, XML-Data and DCD add object-oriented and data modelling concepts such as class inheritance to purely syntactic schemas such as XML DTDs.

DCD Schemas are based on elements and attributes. Elements correspond to RDF property types. DCD declarations constrain the content and attributes of elements in document instances, by assigning properties to objects of type ElementDef and AttributeDef.


6.1 Example of a DCD Schema

The DCD Schema below is based on the following assumptions:

<DCD
   xmlns:DC="http://purl.org/metadata/dublin_core#"
   xmlns:CDT="http://www.w3.org/TR/complex_datatypes#">

   <?DCD syntax="explicit"?>

   <Description>Example of a Video Document DCD</Description>
   <Namespace>http://www.dstc.edu.au/schemas/videodcd</Namespace>

   <ElementDef Type="videodoc" Model="Elements" Root="True">
      <Description>A video document structure.</Description>
      <Group RDF:Order="Seq">
         <Element>dc_values</Element>
         <Group Occurs="ZeroOrMore" RDF:Order="Seq">
             <Element>sequence</Element>
         </Group>
   </ElementDef>

   <ElementDef Type="sequence" Model="Elements">
      <Description>Description of a video sequence element</Description>
      <AttributeDef Name="seqID" Occurs="Required"/>
      <Group RDF:Order="Seq">
         <Element>dc_values</Element>         
         <Element>time_attribs</Element>
         <Group Occurs="ZeroOrMore" RDF:Order="Seq">
             <Element>scene</Element>
         </Group>          
      </Group>
   </ElementDef>

   <ElementDef Type="scene" Model="Elements">
      <Description>Description of a video scene element</Description>
      <AttributeDef Name="sceneID" Occurs="Required"/>
      <Group RDF:Order="Seq">
         <Element>dc_values</Element>
         <Element>time_attribs</Element>
         <Element>transcript</Element>  
         <Element>keyframe</Element>  
         <Group Occurs="ZeroOrMore" RDF:Order="Seq">
             <Element>shot</Element>
         </Group>          
      </Group>
   </ElementDef>

   <ElementDef Type="shot" Model="Elements">
      <Description>Description of a video shot element</Description>
      <AttributeDef Name="shotID" Occurs="Required"/>
      <Group RDF:Order="Seq">
         <Element>dc_values</Element>
         <Element>time_attribs</Element>
         <Element>camera_distance</Element>
         <Element>camera_angle</Element>
         <Element>camera_motion</Element>
         <Element>lighting</Element>
         <Element>open_transition</Element>
         <Element>close_transition</Element>
         <Group Occurs="ZeroOrMore" RDF:Order="Seq">
             <Element>frame</Element>
         </Group>          
      </Group>
   </ElementDef>

   <ElementDef Type="frame" Model="Elements">
      <Description>Description of a video frame element</Description>
      <AttributeDef Name="frameID" Occurs="Required"/>
      <Group RDF:Order="Seq">
         <Element>dc_values</Element>
         <Element>timestamp</Element>
         <Element>CDT:colourhistogram</Element>
         <Element>CDT:texture</Element>
         <Element>annotation</Element>
         <Element>CDT:anno_position</Element>
         <Group Occurs="ZeroOrMore" RDF:Order="Seq">
             <Element>object</Element>
         </Group>          
      </Group>
   </ElementDef>

   <ElementDef Type="object" Model="Elements">
      <Description>Description of a video object/actor element</Description>
      <AttributeDef Name="objectID" Occurs="Required"/>
      <Group RDF:Order="Seq">
         <Element>dc_values</Element>
         <Element>CDT:position</Element>
         <Element>CDT:shape</Element>
         <Element>CDT:colourhistogram</Element>
         <Element>CDT:texture</Element>
         <Element>CDT:trajectory</Element>
         <Element>annotation</Element>
         <Element>CDT:anno_position</Element>
         <Group Occurs="ZeroOrMore" RDF:Order="Seq">
             <Element>object</Element>
         </Group>          
      </Group>
   </ElementDef>

   <ElementDef Type="dc_values" Model="Elements">
      <Description>List of Dublin Core Elements</Description>
      <Group RDF:Order="Seq">
         <Element>DC:Title</Element>
         <Element>DC:Creator</Element>
         <Element>DC:Subject</Element>
         ......
      </Group>
   </ElementDef>

   <ElementDef Type="time_attribs" Model="Elements">
      <Group RDF:Order="Seq">
         <Element>start_time</Element>
         <Element>end_time</Element>
         <Element>duration</Element>
         ......
      </Group>
   </ElementDef> 

   <ElementDef Type="transcript" Model="Data" Datatype="string">
   </ElementDef>
   <ElementDef Type="keyframe" Model="Data" Datatype="uri">
   </ElementDef>

   <ElementDef Type="camera_distance" Model="Data" Datatype="enumeration">
      <Values>close-up medium-shot long-shot</Values>
   </ElementDef>

   <ElementDef Type="camera_angle" Model="Data" Datatype="enumeration">
      <Values>low eye-level high</Values>
   </ElementDef>
  
   <ElementDef Type="open_transition" Model="Data" Datatype="enumeration">
      <Values>cut fade wipe dissolve</Values>
   </ElementDef>

   <ElementDef Type="annotation" Model="Data" Datatype="string">
   </ElementDef>


</DCD>


6.2 Advantages of DCD for Video Metadata


6.3 Disadvantages of DCD for Video Metadata


7. Schema for Object-Oriented XML (SOX)

Schema for Object-Oriented XML (SOX) [17] provides a facility for defining the structure, content and semantics of XML documents to enable XML validation and automated content checking.

SOX provides an alternative to XML DTDs for modeling markup relationships. The introduction to the SOX specification says that it provides the following advantages over XML DTDs:

SOX supports three varieties of datatypes: scalar datatypes, enumerated datatypes and format datatypes. Scalar datatypes are derived from the basic number datatype, and support specification of the number of digits and decimal places, minimum and maximum value range, and a mask. An enumerated datatype may be derived from any of the intrinsic datatypes, and may specify an enumeration of valid values. A format datatype may be derived from any of the intrinsic datatypes, and must specify a mask.

In SOX, element types may inherit their content models and attribute definitions directly from another named element type. An element type may also inherit and extend an attribute list. Specialization of attribute definitions allows refinement and restriction of attribute datatype, enumeration list and default value. Additionally, an attribute value may be defined to be inherited from the identically named attribute on a parent or older ancestor element. Thus, for example, namespaces can be inherited from superordinate elements.

The SOX namespace facility enables Objects from any identifiable namespace to be used in building a SOX document. That is, any element, attribute, datatype, enumeration, entity, interface, notation, parameter, or processing instruction may be imported from any namespace.

A SOX document is a valid XML document, according to the SOX DTD. The schema designer is free to employ the same XML tools used for traditional XML documents. This means that a SOX document can processed by a validating XML parser, formatted according to an XSL stylesheet, and managed by any DOM-compliant or SAX-compliant application.


7.1 SOX Example

In this example, the structural elements, video_doc, video_sequence, video_scene, video_shot, video_frame and video_object are declared first. They each possess the DC attributes, plus their own specific elements and attributes.


<schema name="video_doc" namespace="http://www.dstc.edu.au/schemas/video_doc.xml" >

<h1>Video Metadata Document</h1>

<h2>Imported namespaces</h2>

<namespace name="dc" namespace="http://purl.org/metadata/dublin_core#"/>
<namespace name="dcq" namespace="http://purl.org/metadata/dublin_core_qualifiers#"/>

<h2>Structural Elements</h2>

<elementtype name="video_doc">
    <model>
       <sequence>
              <element name="dc_attributes"/>
              <element name="video_sequence" occurs="*"/>
      </sequence>
   </model>
</elementtype>

<elementtype name="video_sequence">
    <model>
       <sequence>
              <element name="seqID"/>
              <element name="dc_attributes"/>
              <element name="time_attributes"/>
              <element name="video_scene" occurs="*"/>
      </sequence>
   </model>
</elementtype>

<elementtype name="video_scene">
    <model>
       <sequence>
              <element name="sceneID"/>
              <element name="dc_attributes"/>
              <element name="time_attributes"/>
              <element name="transcript"/>
              <element name="key_frame"/>
              <element name="video_shot" occurs="*"/>
      </sequence>
   </model>
</elementtype>

<elementtype name="video_shot">
    <model>
       <sequence>
              <element name="shotID"/>
              <element name="dc_attributes"/>
              <element name="time_attributes"/>
              <element name="camera_distance"/>
              <element name="camera_angle"/>
              <element name="camera_motion"/>
              <element name="lighting"/>
              <element name="open_trans"/>
              <element name="close_trans"/>
              <element name="video_frame" occurs="*"/>
      </sequence>
   </model>
</elementtype>

<elementtype name="video_frame">
    <model>
       <sequence>
              <element name="frameID"/>
              <element name="dc_attributes"/>
              <element name="timestamp"/>
              <element name="colour_histogram"/>
              <element name="texture"/>
              <element name="video_object" occurs="*"/>
      </sequence>
   </model>
</elementtype>

<elementtype name="video_object">
    <model>
       <sequence>
              <element name="objectID"/>
              <element name="dc_attributes"/>
              <element name="position"/>
              <element name="shape"/>
              <element name="colour"/>
              <element name="texture"/>
              <element name="anno_text"/>
              <element name="anno_posn"/>
              <element name="video_object" occurs="*"/>
      </sequence>
   </model>
</elementtype>

The next step is to break down the elements to sub-elements and eventually data types. SOX supports both intrinsic basic datatypes as well as user-defined scalar, enumeration and formatted datatypes, derived from the intrinsic datatypes. The code below illustrates some of the capabilities of SOX data typing for video description.


<h2>Attribute Elements</h2>

<elementtype name="dc_attributes">
  <model>
       <sequence>
              <element namespace="dc" name="Title"/>
              <element namespace="dc" name="Creator"/>
              <element namespace="dc" name="Subject"/>
              .....
      </sequence>        
  </model>
</elementtype>

<elementtype name="time_attributes">
  <model>
       <sequence>
              <element name="start_time"/>
              <element name="end_time"/>
              <element name="duration"/>
      </sequence>        
  </model>
</elementtype>

<elementtype name="start_time">
     <instanceof name="time_val"/>
</elementtype>

<elementtype name="end_time">
     <instanceof name="time_val"/>
</elementtype>

<elementtype name="duration">
     <instanceof name="time_val"/>
</elementtype>

<elementtype name="time_val">
  <model>
    <choice occurs=1>
        <element name="frame_num"/>
        <element name="SMPTE"/>
        <element name="abs_time"/>
    </choice>
  </model>
</elementtype>


<elementtype name="frame_num">
  <model>
        <string datatype="frame"/>
  </model>
</elementtype>

<datatype name="frame">
    <scalar datatype="int" min="1" max="25"/>
</datatype>

<elementtype name="smpte">
  <model>
    <string>
        <mask>##:##:##;##</mask>
    </string>
  </model>
</elementtype>

<elementtype name="abs_time">
  <model>
        <string datatype="time"/>
  </model>
</elementtype>

<elementtype name="key_frame">
  <model>
        <string datatype="URI"/>
  </model>
</elementtype>

<elementtype name="camera_dist">
    <model>
        <string datatype="camera_distances" />
    </model>
</elementtype>

<datatype name="camera_distances">
    <enumeration datatype="nmtoken">
        <option>close-up</option>
        <option>medium-shot</option>
        <option>long-shot</option>
    </enumeration >
</datatype>

</schema>


7.2 Advantages of SOX for Video Metadata


7.3 Disadvantages of SOX for Video Metadata


8. Conclusions: the Ultimate Schema

None of the above schemas is ideal for describing complex multimedia documents. They all satisfy some of the requirements but fall down in other areas. None of them is designed for describing complex hierachical structures in which there are spatial, temporal, structural and conceptual relationships between the components and where these relationships map to constraints on the relative attribute values of the related components. For example, spatial relationships such as neighbours, in-front-of, behind, overlapping and surrounds correspond to certain constraints on the values of the shape, location or volume attributes of the related objects. Similarly temporal relationships such as contains, sequential, parallel and overlapping should be mapped to constraints on the start time, end time and duration of the components. None of the schemas support these capabilities.

RDF Schema claims to differ from the other schemas, in that it is not a "syntactic" schema but a "semantic schema". However the sorts of machine-understandable meanings provided in the current version of RDF Schema is very limited. So the advantage of "semantic validation" is negligible. RDF Schema is good at containing and combining descriptors from different name spaces/communities but it has virtually no data typing. Data types must be defined in a separate name space. This has yet to be done but the work is intended to be done within the W3C XML Schema Working Group [10] which has recently been set up. RDF Schemas also don't easily support multi-layered hierarchical structures because of the inability to specify generic relationship types using properties and to apply these across multiple domain/range pairs. So although RDF is better than the other schemas because of its ability to specify relationships other than the implicit child or contains relationship (which is the only one that the other schemas offer), this facility is limited to a specific range and domain due to the lack of "class-specific constraints".

XML DTDs offer simplicity and fast, cost-effective development due to the ready availability of parsers, tools and applications. However, as a data modelling language, they have limited semantics, which XML-Data, DCD and SOX schemas try to expand by adding things such as strong data typing and lexical constraints.

DCD is an improvement on XML DTDs because it provides better data typing and also provides additional semantics via its RDF conformity. However it doesn't currently support inheritance - although this is a future goal.

SOX has the best data typing. It is also XML compliant so that XML parsers and XML-QL (when it becomes available) will work on it. It supports inheritance but with attribute extension only, not element extension. It is not RDF-conformant. SOX also provides the best cardinality - enabling the minimum (other than 0 or 1) and maximum number of children of an element to be specified e.g. maximum of 10 shots per scene. DCD and XML DTDs can only specify zero or more or one or more children. RDF-Schema doesn't support cardinality.

An additional desirable schema feature would be the ability to define equivalence relationships between attributes and define constraints based on these relationships. For example, suppose there are two attributes, ColourText and ColourHistogram. Then in an ideal schema, users would be able to define an enumerated data type for ColourText (red, yellow, green, blue etc) and for each of these possible values, a correponding permissable range of ColourHistograms would be defined. An even more complex example is the mapping of a textual description attribute to some combination of shape, colour and texture attribute value ranges. Such a schema could then be used to both validate the integrity of the input data but also automatically generate metadata where it is not provided. This ability to map from high-level features (such as text) to low level features (colour, shape, texture) is one of the requirements of the MPEG-7 DDL. It would also greatly improve the searchability of complex multimedia archives. None of the schemas examined provide such sophisticated capabilities.

Other relevant schemas not covered in this paper include: the XSchema specification [18] and XML-Data [16]. XSchema is very similar to SOX. Formerly known as XSD, XSchema began as a proposal for the representation of XML DTDs as XML documents. The advantages of using XML document syntax to describe XML document structures include the ability to browse and edit XSchemas using XML-aware tools. This can't be done on DTDs which are not pure XML documents. Although XML-Data has some very useful features, it appears to have been superseded by DCD.


8.1 The Ideal Schema for Multimedia

Based on the above analysis and comparisons, the best solution for video metadata representation is one which provides the object-oriented, semantical concepts of RDF but expresses them within an easily-understood, human-readable XML schema. We have proposed such a XML schema for the MPEG-7 DDL [19] which provides the following features:

The two major problems associated with this proposal are that it constitutes yet another schema or language and there are likely to be quite complex extensions necessary to the basic XML parser in order to perform complete validation of all of the constraints.

The W3C XML Schema Working Group [10] is looking at a XML-based schema language which provides support for data typing and structural constraints, currently lacking in XML DTDs. Their charter includes delivering a recommendation on the best combination of DCD, XML-Data, SOX and RDF for validating document syntax. Based on the XML Schema Requirements document [20], there is a very real possibility that the schema which they develop will satisfy the majority of the MPEG-7 DDL requirements.


Acknowledgements

The authors wish to acknowledge that this work was carried out within the Cooperative Research Centre for Research Data Networks established under the Australian Government's Cooperative Research Centre (CRC) Program and acknowledge the support of CITEC and the Distributed Systems Technology CRC under which the work described in this paper is administered.


References

[1] MPEG-7, the "Multimedia Content Description Interface"
[2] Hunter J., Iannella R., "The Application of Metadata Standards to Video Indexing", Second European Conference on Research and Advanced Technology for Digital Libraries, Crete, Greece, September, 1998.
[3] Gonno Y., Nishio F., Haraoka K., Yamagishi Y., "Metadata Structuring of Audiovisual Data Streams on MPEG-2 System", Metastructures '98, Montreal, Canada, August, 1998.
[4] Nishio F., Gonno Y., Haraoka K., Yamagishi Y., "Transporting RDF Metadata Associated with Structured Contents", Metastructures '98, Montreal, Canada, August, 1998.
[5] Bhat D., "On Representing Video Structure Using RDF", Doc ISO/IEC JTC1/SC29/WG11 MPEG98/M4132, MPEG Atlantic City Meeting, October 1998.
[6] Dublin Core Home Page
[7] MPEG-7 Requirements Document V.7", Doc ISO/IEC JTC1/SC29/WG11 MPEG98/N2461, MPEG Atlantic City Meeting, October 1998.
[8] "Resource Description Framework (RDF) Model and Syntax Specification", REC-rdf-syntax-19990222, W3C Recommendation, 22 February 1999.
[9] "Resource Description Framework (RDF) Schema Specification", WD-rdf-schema-19990218, W3C Working Draft, 18 February 1999.
[10] XML Schema Working Group
[11] Saarela J., SiRPAC - Simple RDF Parser and Compiler, 25 February 1999.
[12] Extensible Markup Language (XML) 1.0, REC-xml-19980210, W3C Recommendation 10 February 1998.
[13] Deutsch A., Fernandez M., Florescu D., Levy A., Suciu D., XML-QL: A Query Language for XML, Submission to W3C, 19 August 1998.
[14] Bray T., "Adding Strong Data Typing to SGML and XML", May 1997.
[15] Document Content Description for XML, Submission to W3C, 31 July 1998.
[16] XML-Data, W3C Note, 5 January 1998.
[17] Schema for Object-Oriented XML (SOX), NOTE-SOX-19980930, Submission to W3C, 15 September 1998.
[18] XSchema Specification, Version 1.0, November 1998
[19] Hunter J., "A Proposal for an MPEG-7 Description Definition Language", P547, MPEG-7 Test and Evaluation AHG Meeting, Lancaster, February 1999.
[20] "XML Schema Requirements", NOTE-xml-schema-req-19990215, W3C Note, 15 February 1999.


Vitae

Jane Hunter is a Senior Research Scientist within the Resource Discovery Unit at DSTC, investigating international metadata standards and schemas for multimedia resources. She has extensive experience in multimedia indexing, through the development of applications using IBM's Digital Library and the DSTC's SuperNOVA project. She is currently involved in the development of the MPEG-7 Definition Description Language and is also an active participant within the Dublin Core and RDF standards communities. She received a PhD in Computer Animation from the University of Cambridge in 1994.
Liz Armstrong is the Director of the Technology Transfer and Training Unit at the Cooperative Research Centre for Distributed Systems Technology (DSTC Pty Ltd) in Brisbane, Australia. The Unit's activities are centred on the process of transferring technology from the DSTC to the Centre's participant organisations through education, training, special events and secondment programmes. Liz holds a Bachelor of Commerce from Griffith University, with majors in public policy, marketing and video production and she is currently studying for a Masters of Commerce (Information Systems) at the University of Queensland.