Contents
Contents
1. HTML and XML - structuring information for the future (21 pp.)
2. Namespaces, XInclude, and XML Base - common extensions to the XML specification (8 pp.)
3. DTD, XML Schema, and DSD - defining language syntax with schemas (27 pp.)
4. XLink, XPointer, and XPath - linking and addressing (26 pp.)
5. XSL and XSLT - stylesheets and document transformation (21 pp.)
6. XQuery - document querying (15 pp.)
7. DOM, SAX, and JDOM - programming for XML (15 pp.)
8. W3C - some background on the World Wide Web Consortium (5 pp.)
[pic]Markup Languages: HTML and XML
HTML - original motivation, development, and inherent limitations:
• Hyper-Text Markup Language - the Web today
• Original motivation for HTML - some history
• Compact and human readable - alternative document formats
• From logical to physical structure - requirements from users
• Stylesheets - separating logical structure and layout
• Different versions of HTML - a decade of development
• Syntax and validation - HTML as a formal language
• Browsers are forgiving - the real world
• Structuring general information - not everything is hypertext
• Problems with HTML - why HTML is not the solution
XML as the universal format for structuring information:
• What is XML? - the universal data format
• HTML vs. XML - the key differences
• A conceptual view of XML - XML documents as labeled trees
• A concrete view of XML - XML documents as text with markup
• Applications of XML - an XML language for every domain
• The recipe example - designing a small XML language
• From SGML to SML - a word on doc-heads and development
• SGML relics - things to avoid
• XML technologies - generic languages and tools for free
Selected links:
• Basic XML tools
• Links to more information
Hyper-Text Markup Language
HTML: Hyper-Text Markup Language
What is hyper-text?
• a document that contains links to other documents (and text, sound, images...)
• links may be actuated automatically or on request
• linked documents may replace, be inlined, or create a new window
• most combinations are supported by HTML
What is a markup language?
• a notation for writing text with markup tags
• the tags indicate the structure of the text
• tags have names and attributes
• tags may enclose a part of the text
The start of the HTML for this page, with text, tags, and attributes:
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|Hyper-Text Markup Language |
|What is hyper-text? |
| |
|a document that contains links to other documents |
|(and text, sound, images...) |
|links may be actuated automatically or on request |
|linked documents may replace, be inlined, |
|or create a new window |
|most combinations are supported by HTML |
| |
Original motivation for HTML
Exchange data on the Internet:
• documents are published by servers
• documents are presented by clients (browsers)
HTML was created by Tim Berners-Lee and Robert Caillau at CERN in 1991:
• the motivation was to keep track of experimental data
HTML describes only the logical structure of documents:
• browsers are free to interpret markup tags as they please
• the document even makes sense if the tags are ignored
HTML combined well-known ideas:
• hyper-text was known since 1945
• markup languages date back to 1970
Compact and human readable
Many document formats are very bulky:
• the author controls the precise layout
• all details, including many font tables, must be stored with the contents
In comparison, HTML is slim:
• the author sacrifices control for compactness
• only the actual contents and its logical structure is represented
Sizes of documents containing just the text "Hello World!":
|PostScript |hello.ps |11,274 bytes |
|PDF |hello.pdf |4,915 bytes |
|MS Word |hello.doc |19,456 bytes |
|HTML |hello.html |44 bytes |
Compactness is good for:
• saving space on your server
• lowering network traffic
(Don't worry about voluminous markup - specialized compression techniques are emerging.)
Furthermore, HTML documents can be written and modified with any raw-text editor.
From logical to physical structure
Originally, HTML tags described logical structure:
• h2: "this is a header at level 2"
• em: "this text should be emphasized"
• ul: "this is a list of items"
Quickly, (non-physicist) users wanted more control:
• "this header is centered and written in Times-Roman in size 28pt"
• "this text is italicized"
• "these list items are indented 7mm and use pink elephants for bullets"
The early hack for commercial pages was to make everything a huge image:
|HTML |hello.html |44 bytes |
|GIF |hello.gif |32,700 bytes |
The HTML developers responded with more and more physical layout tags.
Stylesheets
Cascading Style Sheets (CSS):
• specify physical properties (layout) of HTML tags
• are (usually) written in separate files
• can be shared for many HTML documents
There are many advantages:
• logical and physical properties may be separated
• document groups can have consistent looks
• the look can easily be changed
A CSS stylesheet works by:
• allowing more than 50 properties to be defined for each kind of tag;
• the definitions for a tag may depend on its context
• undefined properties are inherited from enclosing tags
• normal HTML corresponds to default values of properties
Using stylesheets, all tags become logical - however, CSS stylesheets only address superficial properties of documents.
A CSS stylesheet is a collection of selectors and properties:
|B {color:red;} |
|B B {color:blue;} |
|B.foo {color:green;} |
|B B.foo {color:yellow;} |
|B.bar {color:maroon;} |
In the HTML document, the most specific properties are chosen, so:
|Hey! |
|Wow!! |
|Amazing!!! |
|Impressive!!!! |
|k00l!!!!! |
|Fantastic!!!!!! |
| |
gives the result:
|Hey! Wow!! Amazing!!! Impressive!!!! k00l!!!!! Fantastic!!!!!! |
When properly used, the physical layout (a CSS file) is separated from logical structure and the actual contents (a HTML file).
With CSS stylesheets, any tag can be made to look like any other tag.
The default layout in a browser corresponds to a default stylesheet.
Different versions of HTML
HTML has been developed extensively over the years:
1992
HTML is first defined
1993
HTML+ (some physical layout, fill-out forms, tables, math)
1994
HTML 2.0 (standard for core features)
HTML 3.0 (an extension of HTML+ submitted as a draft standard)
1995
Netscape-specific non-standard HTML appears
1996
Competing Netscape and Explorer versions of HTML
HTML 3.2 (standard based on current practices)
1997
HTML 4.0 (separates structure and presentation with stylesheets)
1999
HTML 4.01 (slight modifications only)
2000
XHTML 1.0 (XML version of HTML 4.01)
2001
XHTML 1.1 (modularization to allow different subsets)
2002
XHTML 2.0 (simplifying and generalizing several tags)
Syntax and validation
HTML 4.01 has a precise and formal syntax definition.
• every HTML document should satisfy this definition
• this can be automatically validated
• valid documents get an official seal of approval: [pic]
• invalid documents get a list of error messages
Browsers are forgiving
Most HTML documents are in fact not valid:
• authors are careless
• documents are "validated" by viewing them in a browser
• autogenerated HTML is often invalid
Even so, most HTML pages look fine:
• the browsers do their best
• no syntax errors are ever reported
|Lousy HTML |Lousy HTML |
|This is not very good. |• This is not very good. |
|In fact, it is quite bad |• In fact, it is quite bad |
| |But the browser does something. |
|But the browser does something. | |
A different approach is HTML Tidy, which corrects (some) errors in HTML documents.
This is problematic:
• it promotes bad HTML
• different browsers do different "clever" things
• it is very hard to use invalid documents for other things than browsing, e.g. for automatic processing by other tools!
Structuring general information
Consider the following recipe collection published in HTML:
|Rhubarb Cobbler |
|Maggie.Herrick@bbs. |
|Wed, 14 Jun 95 |
| |
|Rhubarb Cobbler made with bananas as the main sweetener. |
|It was delicious. Basicly it was |
| |
| |
| 2 1/2 cups diced rhubarb (blanched with boiling water, drain) |
| 2 tablespoons sugar |
| 2 fairly ripe bananas sliced 1/4" round |
| 1/4 teaspoon cinnamon |
| dash of nutmeg |
| |
| |
|Combine all and use as cobbler, pie, or crisp. |
| |
|Related recipes: Garden Quiche |
There are many problems with this approach to using HTML:
• the semantics is encoded into text formatting tags
• there is no means of checking that a recipe is encoded correctly
• it is difficult to change the layout of recipes (CSS is not enough)
It would be much better to invent a special "recipe markup language"...
Problems with HTML
• The language is by design hardwired to describe hypertext:
o there is a fixed collection of tags with a fixed semantics
o but much information just is not hypertext!
• Syntax and semantics is mixed together:
o the structuring of data dictates its presentation in browsers
o stylesheets only provide a weak solution
o different views are not supported
• The standards have been undermined:
o most HTML documents are invalid
o the browsers define sloppy ad-hoc standards
What is XML?
XML: eXtensible Markup Language
XML is a framework for defining markup languages:
• there is no fixed collection of markup tags - we may define our own tags, tailored for our kind of information
• each XML language is targeted at its own application domain, but the languages will share many features
• there is a common set of generic tools for processing documents
XML is not a replacement for HTML:
• HTML should ideally be just another XML language
• in fact, XHTML is just that
• XHTML is a (very popular) XML language for hypertext markup
XML is designed to:
• separate syntax from semantics to provide a common framework for structuring information (browser rendering semantics is completely defined by stylesheets);
• allow tailor-made markup for any imaginable application domain
• support internationalization (Unicode) and platform independence
• be the future of structured information, including databases
HTML vs. XML
Consider the HTML recipe collection again:
|Rhubarb Cobbler |
|Maggie.Herrick@bbs. |
|Wed, 14 Jun 95 |
| |
|Rhubarb Cobbler made with bananas as the main sweetener. |
|It was delicious. Basicly it was |
| |
| |
| 2 1/2 cups diced rhubarb |
| 2 tablespoons sugar |
| 2 fairly ripe bananas |
| 1/4 teaspoon cinnamon |
| dash of nutmeg |
| |
| |
|Combine all and use as cobbler, pie, or crisp. |
| |
|Related recipes: Garden Quiche |
With XML, we can instead define our own "recipe markup language" where the markup tags directly correspond to concepts in the world of recipes:
| |
|Rhubarb Cobbler |
|Maggie.Herrick@bbs. |
|Wed, 14 Jun 95 |
| |
| |
|Rhubarb Cobbler made with bananas as the main sweetener. |
|It was delicious. |
| |
| |
| |
|2 1/2 cupsdiced rhubarb |
|2 tablespoonssugar |
|2fairly ripe bananas |
|1/4 teaspooncinnamon |
|dash ofnutmeg |
| |
| |
| |
|Combine all and use as cobbler, pie, or crisp. |
| |
| |
|Garden Quiche |
| |
This example illustrates:
• the markup tags are chosen purely for logical structure
• this is just one choice of markup detail level
• we need to define which XML documents we regard as "recipe collections"
• we need a stylesheet to define browser presentation semantics
• we need to express queries in a general way
Later:
• XML Schema will later be used to define our class of recipe documents
• XSLT will be used to transform the XML document into XHTML (or HTML), including automatic construction of index, references, etc.
• XLink, XPointer, and XPath could be used to create cross-references
• XQuery will be used to express queries
A conceptual view of XML
An XML document is an ordered, labeled tree:
• character data leaf nodes contain the actual data (text strings)
o usually, character data nodes must be non-empty and non-adjacent to other character data nodes
• elements nodes, are each labeled with
o a name (often called the element type), and
o a set of attributes, each consisting of a name and a value,
and these nodes can have child nodes
A tree view of the XML recipe collection:
[pic]
The tree structure of a document can be examined in the Explorer browser.
In addition, XML trees may contain other kinds of leaf nodes:
• processing instructions - annotations for various processors
• comments - as in programming languages
• document type declaration - described later...
Unfortunately, XML is not as simple as it could be, and there is still no agreement on XML tree terminology :-(
A concrete view of XML
An XML document is a (Unicode) text with markup tags and other meta-information.
Markup tags denote elements:
| ......... |
| | | | | |
| | | | a matching element end tag |
| | | the contents of the element |
| | an attribute with name attr and value val, values enclosed by ' or " |
| an element start tag with name foo |
There is a short-hand notation for empty elements: ......
An XML document must be well-formed:
• start and end tags must match
• element tags must be properly nested
• + some more subtle syntactical requirements
Note: XML is case sensitive!
Special characters can be escaped using Unicode character references:
• < and < both yield <
• & and & both yield &
CDATA Sections are an alternative to escaping many characters:
• Hello, world!]]>
The strange syntax is a legacy from SGML...
White-space (blanks, newlines, etc.) is used both for indentation and actual contents. (xml:space attribute provides some control.)
Other meta-information:
an instruction for a processor, target identifies the processor for which it is directed, data is a string containing the instruction
a comment, will be ignored by all processors
document type declaration (described later...)
Applications of XML
There are already hundreds of serious applications of XML.
XHTML
W3C's XMLization of HTML 4.0. Example XHTML document:
| |
| |
| Hello world! |
| foobar |
| |
CML
Chemical Markup Language. Example CML document snippet:
| |
| |
| C O H H H H |
| |
|-0.748 0.558 -1.293 -1.263 -0.699 0.716 |
| |
| |
| |
WML
Wireless Markup Language for WAP services:
| |
| |
| |
| |
|Hello World |
| |
| |
| |
ThML
Theological Markup Language:
| Having a Humble Opinion of Self |
|EVERY man naturally desires knowledge |
| |
| |
|Aristotle, Metaphysics, i. 1. |
| |
|; |
|but what good is knowledge without fear of God? Indeed a humble |
|rustic who serves God is better than a proud intellectual who |
|neglects his soul to study the course of the stars. |
| |
| |
|Augustine, Confessions V. 4. |
| |
| |
| |
There is a long list of many other XML applications.
The recipe example
Consider again recipes, such as in this example (raw text file).
We design an XML version of a recipe collection:
• recipes consist of ingredients, steps for preparation, possibly some comments, and a specification of its nutrition
• an ingredient can be simple or composite
• a simple ingredient has a name, an amount (possibly unspecified), an a unit (unless amount is dimensionless)
• a composite ingredient is recursively a recipe
This example (formatted XML file) contains five recipes. Abbreviated version:
| |
| |
| |
|Some recipes used for the XML tutorial. |
| |
| |
|Beef Parmesan with Garlic Angel Hair Pasta |
| |
|... |
| |
| |
|Preheat oven to 350 degrees F (175 degrees C). |
| |
|... |
| |
| |
|Make the meat ahead of time, and refrigerate over night, the acid in the |
|tomato sauce will tenderize the meat even more. If you do this, save the |
|mozzarella till the last minute. |
| |
| |
| |
|... |
| |
XML documents (usually) begin with an XML declaration ().
The recipe example
Consider again recipes, such as in this example (raw text file).
We design an XML version of a recipe collection:
• recipes consist of ingredients, steps for preparation, possibly some comments, and a specification of its nutrition
• an ingredient can be simple or composite
• a simple ingredient has a name, an amount (possibly unspecified), an a unit (unless amount is dimensionless)
• a composite ingredient is recursively a recipe
This example (formatted XML file) contains five recipes. Abbreviated version:
| |
| |
| |
|Some recipes used for the XML tutorial. |
| |
| |
|Beef Parmesan with Garlic Angel Hair Pasta |
| |
|... |
| |
| |
|Preheat oven to 350 degrees F (175 degrees C). |
| |
|... |
| |
| |
|Make the meat ahead of time, and refrigerate over night, the acid in the |
|tomato sauce will tenderize the meat even more. If you do this, save the |
|mozzarella till the last minute. |
| |
| |
| |
|... |
| |
XML documents (usually) begin with an XML declaration ().
SGML relics
- only a fool does not fear "external general parsed entities"
As an unfortunate heritage from SGML, the header of an XML document may contain a document type declaration:
| |
| |
| |
|]> |
| &hi; world! |
This part can contain:
• DTD (Document Type Definition) information:
o element type declarations (ELEMENT)
o attribute-list declarations (ATTLIST)
(described later...)
• entity declarations (ENTITY) - a simple macro mechanism
• notation declarations (NOTATION) - data format specifications
Avoid all these features whenever possible!
Unfortunately, they cannot always be ignored - all XML processors (even non-validating ones) are required to:
• normalize attribute values (prune white-space etc.)
• handle internal entity references (e.g. expand &hi; in greeting)
• insert default attribute values (e.g. insert style="small" in greeting)
according to the document type declaration, if a such is present.
XML technologies
XML is:
• hot ($$$)
• the standard for representation of Web information
• by itself, just a notation for hierarchically structured text
But a notation for tree structures is not enough:
• the real force of XML is generic languages and tools!
• by building on XML, you get a massive infrastructure for free
The XML vision offers:
• common extensions to the core XML specification
a namespace mechanism, document inclusion, etc.
• schemas
grammars to define classes of documents
• linking between documents
a generalization of HTML anchors and links
• addressing parts of read-only documents
flexible and robust pointers into documents
• transformation
conversion from one document class to another
• querying
extraction of information, generalizing relational databases
To "use XML":
1. define your XML language (use e.g. XML Schema to define its syntax)
2. exploit the generic XML tools (e.g. XSLT and XQuery processors), the generic protocols, and the generic programming frameworks (e.g. DOM or SAX) to build application tools
These technologies are described in the following sections.
Other related technologies (not covered here):
• XML Information Set
attempt to define common terminology for XML document concepts
("information set"=tree, "information item"=node, ...)
• XML-Signature
digital signatures of Web resources
• XML Encryption
encryption of Web resources
• XML Fragment Interchange
for dealing with fragments of XML documents
• XML Protocol and SOAP (Simple Object Access Protocol)
information exchange protocol
• XForms
a common sublanguage for input forms (with XHTML forms as a special case)
• RDF (Resource Description Framework)
a framework for metadata (statements about properties and relationships)
Basic XML tools
Parsers
• XML4J / Xerces (alphaworks.tech/xml4j)
From alphaWorks, in Java, supports DOM and SAX
• Expat (expat.)
Written in C (ported to other languages), used by LIBWWW, Apache, Netscape, ...
• + 1000 others...
Editors
• Xeena (alphaWorks.tech/xeena)
From alphaWorks, in Java, with tree-view syntax directed editing
• XMLSpy ()
Popular, but not free :-(
• + 1000 others...
Servers and Browsers
• Apache XML (xml.)
built in Xerces XML parser, Xalan XSLT processor, ...
• Netscape Navigator 6 and Internet Explorer 5
XML parsing and validation, rendering with XSL and CSS, script access via DOM, ...
• Amaya (Amaya)
W3C's editor/browser
Links to more information
TR/REC-xml.html
the XML 1.0 specification
TR/xml11
the XML 1.1 draft specification, minor changes to reflect Unicode revisions
XML
W3C's XML homepage
XML information by O'Reilly: articles, software, tutorials
cover
The XML Cover Pages: comprehensive online reference
: concise XML news
news:comp.text.xml
XML newsgroup
ucc.ie/xml
XML FAQ
axml/testaxml.htm
the Annotated XML Specification, by Tim Bray
metalab.unc.edu/xml
Cafe con Leche XML News and Resources
inf2.pira.co.uk/top011a.htm
El.pub's markup language section
wdvl.Authoring/Languages/XML
links to XML information
xml
XML School: an XML tutorial
garshol.priv.no/download/xmltools
a list of free XML tools
Namespaces, XInclude, and XML Base
- common extensions to the core XML specification
Namespaces - mixing XML languages
• Mixing XML languages - name clashes
• Qualifying names - solving the problem with URIs
• Namespace declarations - declarations and prefixes
XInclude - combining XML documents
• Combining XML documents - reuse and modularity
• An XInclude example - an example
• XInclude details - more details
XML Base - resolving relative URIs
• XML Base - another common XML extension
Selected links:
• Links to more information
Mixing XML languages
Consider an XML language WidgetML which uses XHTML as a sublanguage for help messages:
| |
| |
| |
| |
| |
| Description of gadget |
| |
| |
| Gadget |
| A gadget contains a big gizmo |
| |
| |
| |
A problem: the meaning of head and big depends on the context!
This complicates things for processors and might even cause ambiguities.
The root of the problem is: one common name space.
Qualifying names
Simple solution: qualify names with URIs (Universal Resource Identifiers)
| |
| \ / \ / |
| ------------------------- -- |
| qualifying URI local name |
Do not be confused by the use of URIs for namespaces:
• they are not supposed to point to anything
• it is simply the cheapest way of getting unique names
• we rely on existing organizations that control domain names
(just like Java package names!)
This is the idea - the actual solution is less verbose but slightly more complicated...
Namespace declarations
Namespaces are declared by special attributes and associated prefixes:
| |
| ... |
| ... |
| ... |
| |
xmlns:prefix="URI" declares a namespace with a prefix and a URI:
• the scope of declaration is lexical, the element containing the declaration and all descendants can be overridden by nested declaration
• both element and attribute names can be qualified with namespaces
• the name of the prefix is irrelevant - applications should use only the URI
For backward compatibility and simplicity, unprefixed element names are assigned a default namespace:
• declaration: xmlns="URI"
• default value: "" (means: treat as unqualified name)
• does not affect unprefixed attribute names (they belong to the containing elements)
WidgetML with namespaces:
| |
| |
| |
| |
| |
| Description of gadget |
| |
| |
| Gadget |
| A gadget contains a big gizmo |
| |
| |
| |
How should a relative URI be interpreted?
• relative to the base URI?
• relative to the document URI?
• just as a string?
This innocent question spawned a controversy that resulted in leaving the matter undefined (by deprecating such namespaces).
Other controversies:
• does the choice of prefix matter, or is
the same as
?
• is the same as
?
(Unfortunately, according to the spec, the choice of prefix may matter, and an unqualified attribute generally does not just inherit the namespace from its element.)
Combining XML documents
To enhance reuse and modularity, a technique for constructing new XML documents from existing ones is desirable.
XInclude provides a simple inclusion mechanism.
Why yet another specification?
• many XML documents and languages can benefit from modularity
• as for the namespace solution, a generic approach can be implemented in generic tools
Application conformance: Think of XML as if Namespaces, XInclude, and XML Base were parts of the basic XML specification. (Caveat: the latter two are not widely implemented yet.)
An XInclude example
A document containing:
| |
| |
| |
where somewhere.xml contains:
|... |
is equivalent to:
| |
| ... |
| |
• is the official XInclude namespace
• the include element name in that namespace is an inclusion directive
• right after parsing and before other processing, an XInclude processor performs the inclusion (tree substitution)
• the original and the resulting document should be considered equivalent
• it is an error to have cyclic includes
XInclude details
How is the included resource denoted?
• with XPointer (described later...) - an extension of URLs that can address document nodes, node sets, or character data ranges
Other issues:
• with parse="text" and encoding="..." attributes, a resource can be transformed into a character data node before inclusion
• XInclude processors may need to create namespace declaration attributes to ensure equivalence
Many XInclude processors support only whole-document URIs, not full XPointer.
XML Base
A URI identifies a resource:
• is an absolute URI
• somefile.xml is a relative URI
Inspired by the mechanism in HTML, XML Base provides a uniform way of resolving relative URIs.
In the following example:
| |
| |
| |
the value of href attribute can be interpreted as the absolute URI .
• the xml namespace prefix is hardwired by the Namespace specification
• xml:base has lexical scope (as namespace declarations)
• the URI used to access the document is used as default URI base
Examples of applications:
• XLink (requires XML Base support)
• XHTML (will use XML Base)
• Namespaces (does not conform to XML Base, but it ought to...)
• your future XML language
Future XML parsers will support Namespaces, XInclude, and XML Base.
Links to more information
Namespaces:
TR/REC-xml-names
the W3C XML Namespace Recommendation
xml/xmlns.htm
an explanation of the recommendation by James Clark
xml/pub/1999/01/namespaces.html
an article on Namespaces
xml/NamespacesFAQ.htm
comprehensive Namespace FAQ
XInclude:
TR/xinclude
XInclude, W3C Candidate Recommendation
xml/XInclude
a Java XInclude processor
XML Base:
TR/xmlbase
the W3C XML Base Recommendation
DTD, XML Schema, and DSD
- defining language syntax with schemas
Overview:
• Schemas and schema languages - defining the syntax of your own XML language
• Choosing a schema language - lots of alternatives
DTD - the insufficient schema language defined in the XML 1.0 spec:
• DTD - Document Type Definition - an overview
• Example DTD - the recipe example
• Problems with DTD - top 15 reasons for not using DTD
XML Schema - W3C's recent proposal:
• Design requirements - how to design a schema language in W3C
• XML Schema - the design
• A small example - the business-card example
• Overview - the central constructs and ideas
• Constructing complex types - requirements for attribute and content presence
• Constructing simple types - requirements for attribute values and character data
• Local definitions - inlined declarations, anonymous types, and overloading
• Inheritance and substitution groups - the type system
• Annotations - self-documentation
• Schema inclusion and redefinition - modularity and reuse
• Namespaces - constraining the use of namespaces
• Attribute and element defaults - side-effects of validation
• Identity constraints - uniqueness and keys
• A larger example - the recipe example
• Problems with XML Schema - 15 reasons why we haven't seen the last schema language
DSD - the next generation of schema languages:
• Document Structure Description 2.0 - central aspects
• Example - the recipe example
• Rules - describing elements
• Boolean expressions - expressing element properties
• Regular expressions - describing attribute values and chardata
• Inclusion and extension - modular descriptions
Selected links:
• Links to more information
Schemas and schema languages
A schema is a definition of the syntax of an XML-based language (i.e. a class of XML documents).
A schema language is a formal language for expressing schemas.
Schema processing: Given an XML document and a schema, a schema processor
• checks for validity, i.e. that the document conforms to the schema requirements
• if the document is valid, a normalized version is output: default attributes and elements are inserted, parsing information may be added, etc.
The document being validated is called an instance document or application document.
[pic]
Why bother formalizing the syntax with a schema?
• a formal definition provides a precise but human-readable reference
• schema processing can be done with existing implementations
• your own tools for your language can benefit: by piping input documents through a schema processor, you can assume that the input is valid and defaults have been inserted
Schemas are similar to grammars for programming languages, however, context-free grammars are not expressive enough for XML.
The term "schema" comes from the database community.
Choosing a schema language
There have been many schema language proposals.
W3C proposals:
• DTD
• XML-Data, January 1998
• DCD (Document Content Description), July 1998
• DDML (Document Definition Markup Language), January 1999
• SOX (Schema for Object-oriented XML), July 1999
• XML Schema
Non-W3C proposals:
• Assertion Grammars by Dave Raggett
• Schematron by Rick Jellife
• TREX (Tree Regular Expressions for XML) by James Clark
• Examplotron by Eric van der Vlist
• RELAX by Makoto Murara / RELAX NG by Murata and Clark
• DSD (Document Structure Description)
Unlike for many other XML technologies, it has proved difficult to reach a consensus - probably because:
• it is an inherently difficult problem
• people have different needs from a schema language
• the official (W3C) proposals are not very good
however, most schema languages have many similarities.
We shall look at W3C's DTD and XML Schema proposals and at the DSD proposal developed by BRICS and AT&T.
DTD - Document Type Definition
Recall from earlier that XML 1.0 contains a built-in schema language: Document Type Definition
•
determines the name of the root element and contains the document type declarations
•
associates a content model to all elements of the given name
content models:
o EMPTY: no content is allowed
o ANY: any content is allowed
o (#PCDATA|element-name|...): "mixed content", arbitrary sequence of character data and listed elements
o deterministic regular expression over element names: sequence of elements matching the expression
▪ choice: (...|...|...)
▪ sequence: (...,...,...)
▪ optional: ...?
▪ zero or more: ...*
▪ one or more: ...+
•
declares which attributes are allowed or required in which elements
attribute types:
o CDATA: any value is allowed (the default)
o (value|...): enumeration of allowed values
o ID, IDREF, IDREFS: ID attribute values must be unique (contain "element identity"), IDREF attribute values must match some ID (reference to an element)
o ENTITY, ENTITIES, NMTOKEN, NMTOKENS, NOTATION: just forget these... (consider them deprecated)
attribute defaults:
o #REQUIRED: the attribute must be explicitly provided
o #IMPLIED: attribute is optional, no default provided
o "value": if not explicitly provided, this value inserted by default
o #FIXED "value": as above, but only this value is allowed
This is a simple subset of SGML DTD.
Validity can be checked by a simple top-down traversal of the XML document (followed by a check of IDREF requirements).
Example DTD
A DTD for our recipe collections, recipes.dtd:
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
By inserting:
| |
in the headers of recipe collection documents, we state that they are intended to conform to recipes.dtd.
Alternatively, the DTD can be given locally with .
This grammatical description has some obvious shortcomings:
• unit should only be allowed when amount is present
• the comment element should be allowed to appear anywhere
• nested ingredient elements should only be allowed when amount is absent
Problems with DTD
Top 15 reasons for avoiding DTD:
1. not itself using XML syntax (the SGML heritage can be very unintuitive + if using XML, DTDs could potentially themselves be syntax checked with a "meta DTD")
2. mixed into the XML 1.0 spec (would be much less confusing if specified separately + even non-validating processors must look at the DTD)
3. no constraints on character data (if character data is allowed, any character data is allowed)
4. too simple attribute value models (enumerations are clearly insufficient)
5. cannot mix character data and regexp content models (and the content models are generally hard to use for complex requirements)
6. no support for Namespaces (of course, XML 1.0 was defined before Namespaces)
7. very limited support for modularity and reuse (the entity mechanism is too low-level)
8. no support for schema evolution, extension, or inheritance of declarations (difficult to write, maintain, and read large DTDs, and to define families of related schemas)
9. limited white-space control (xml:space is rarely used)
10. no embedded, structured self-documentation ( are not enough)
11. content and attribute declarations cannot depend on attributes or element context (many XML languages use that, but their DTDs have to "allow too much")
12. too simple ID attribute mechanism (no points-to requirements, uniqueness scope, etc.)
13. only defaults for attributes, not for elements (but that would often be convenient)
14. cannot specify "any element" or "any attribute" (useful for partial specifications and during schema development)
15. defaults cannot be specified separate from the declarations (would be convenient to have defaults in separate modules)
Design requirements
Quotes from the W3C Note "XML Schema Requirements" (Feb. 1999):
[pic]
Design principles:
The XML schema language shall be
1. more expressive than XML DTDs
2. expressed in XML
3. self-describing
4. usable by a wide variety of applications that employ XML
5. straightforwardly usable on the Internet
6. optimized for interoperability
7. simple enough to be implemented with modest design and runtime resources
8. coordinated with relevant W3C specs
The XML schema language specification shall
1. be prepared quickly
2. be precise, concise, human-readable, and illustrated with examples
Structural requirements:
The XML schema language must define
1. mechanisms for constraining document structure (namespaces, elements, attributes) and content (datatypes, entities, notations)
2. mechanisms to enable inheritance for element, attribute, and datatype definitions
3. mechanism for URI reference to standard semantic understanding of a construct
4. mechanism for embedded documentation
5. mechanism for application-specific constraints and descriptions
6. mechanisms for addressing the evolution of schemata
7. mechanisms to enable integration of structural schemas with primitive data types
[pic]
Unfortunately, their own XML Schema Recommendation does not fulfil all requirements (self-describing, simple, concise, human-readable, ...)
XML Schema
W3C Recommendation, May 2001.
Consists of two parts:
1. Structures
2. Datatypes
Main features:
• XML syntax (there is a Schema for Schemas)
• uses and supports Namespaces
• object-oriented-like type system for declarations (with inheritance, subsumption, abstract types, and finals)
• global (=top-level) and local (=inlined) type definitions
• modularization (schema inclusion and redefinitions)
• structured self-documentation
• cardinality constraints for sub-elements
• nil values (missing content)
• attribute and element defaults
• any-element, any-attribute
• uniqueness constraints and ID/IDREF attribute scope
• regular expressions for specifying valid chardata and attribute values
• lots of built-in data types for chardata and attribute values
Yes, it is big and complicated! (Part 1 of the spec alone is around 200 pages...)
A small example
Assume we want to create an XML-based language for business cards.
An example document john_doe.xml:
| |
|John Doe |
|CEO, Widget Inc. |
|john.doe@ |
|(202) 456-1414 |
| |
| |
To describe the syntax of our new language, we write a schema business_card.xsd:
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
The XML Schema language is recognized by the namespace .
A document may refer to a schema with the schemaLocation (or the noNamespaceSchemaLocation) attribute:
| |
|... |
| |
By inserting this, the author claims that the document is intended to be valid with respect to the schema (not that it necessarily is valid).
Overview of XML Schema
The most central top-level constructs:
• a (global) element declaration associates an element name with a type
• a complex type definition defines requirements for attributes, sub-elements, and character data in elements of that type
o attribute declarations: describe which attributes that may or must appear
o element references: describe which sub-elements that may or must appear, how many, and in which order
• a simple type definition defines a set of strings to be used as attribute values or character data
An element in an XML document is valid according to a given schema if the associated element type rules are satisfied.
If all elements are valid, the whole document is called valid. (Unlike DTD, there is no way to require a specific root element.)
Naming conflicts: two types or two elements cannot be defined with the same name, but an element declaration and a type definition may use the same name.
Constructing complex types
A complexType can contain:
• attribute declarations:
where type refers to a simple type definition and use is either required, optional, or prohibited
• one of the following content model kinds:
o empty content (the default)
o simple content: ... (only character data is allowed)
o regexp content: a (restricted) combination of
▪ ...
▪ ...
▪ ...
containing element references of the form
where ref refers to an element definition, and minOccurs and maxOccurs constrain the number of occurences
(if complexType has the attribute mixed="true", arbitrary character data is also allowed)
Example:
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Grouping of definitions:
Attribute groups: groups of attribute declarations can be defined with ... and used with .
Element groups: similarly, groups of regexp content model descriptions can be defined and used with the group construct.
Constructing simple types
Simple types can be:
• primitive (hardwired meaning)
• derived from existing simple types:
o by a list: white-space separated sequence of other simple types
o by a union: union of other simple types
o by a restriction:
▪ length, minLength, maxLength (list lengths)
▪ enumeration (intersection with list of values)
▪ pattern (intersection with Perl-like regexp)
▪ whiteSpace (preserve/replace/collapse white-space)
▪ minInclusive, maxInclusive (bounds on numbers)
A lot of often-used simple types (all the primitive and some derived) are predefined:
• integer
• date
• anyURI
• unsignedLong
• language
• ...
Example definition of a derived simple type:
| |
| |
| |
| |
| |
All this is specified in Part 2 of the spec.
Local definitions
Instead of writing all element declarations and type definitions at top-level (globally), they may be inlined (locally):
Example:
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
means the same as
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
(where the complex type card_type and the description of name have been inlined)
except that:
• inlined type definitions are anonymous, so they cannot be referred to for reuse
• inlined element declarations can be overloaded, i.e. they need not have unique names
- otherwise, it is just a matter of authoring style.
Inheritance and substitution groups
XML Schema contains an incredibly complicated type system.
As in many programming languages, XML Schema allows (complex) types to be declared as sub-types of existing types.
• inheritance by extension:
| |
| |
| |
| |
| |
| |
| |
creates a car type from a vehicle type by extending it with 3 or 4 wheel sub-elements
• inheritance by restriction:
| |
| |
| |
| |
| |
| |
| |
creates a small_car type from the car type by restricting it to 3 wheel sub-elements
Subsumption:
Assume that we declare an element:
| |
meaning that myVehicle elements are valid if they match the vehicle type.
Since car is a sub-type of vehicle, myVehicle elements are also valid if they match car - provided that we add xsi:type="n:car" to the elements.
(xsi refers to )
Substitution groups: - another (simpler and better) way of achieving basically the same
If we declare another element as follows:
| |
then we may always use myCar elements whenever myVehicle elements are required (without using xsi:type).
This is independent of the extension/restriction inheritance hierarchy! - car is not required to be declared as a sub-type of vehicle.
Abstract and final:
In addition to all this,
• inheritance of types can be forbidden (by declaring them as final)
• use of elements and types can be forbidden (declared abstract)
Annotations
Schemas can be annotated with human or machine readable documentation and other information:
| |
| |
| |
|the author of the recipe, |
|see this list of authors |
| |
| |
| |
| |
| |
|... |
| |
Note that annotations can be structured, as opposed to simple XML comments.
Schema inclusion and redefinition
No less than 3 mechanisms are available:
• - compose with schema having same target namespace
• - compose with schema having different target namespace
• ... - compose with schema having same target namespace, allowing redefinitions
It ought to also be possible to use XInclude, but that is not mentioned in the XML Schema spec.
Example:
| |
| |
| |
| |
| |
| |
|... |
| |
| |
| |
|... |
| |
Here, a schema for XHTML is imported together with phone.xsd (which is assumed to contain a description of phone numbers) and its description of phone is redefined.
Namespaces
When defining a new XML-based language, we usually want to assign it a unique namespace.
XML Schema
• uses namespaces itself - to distinguish schema instructions from the language we are describing
• supports namespace assigning - by associating a target namespace to the language we are describing
Example:
| |
| |
| |
|... |
| |
| |
| |
| |
|... |
| |
| |
| |
|... |
| |
• the default namespace is that of XML Schema (such that e.g. complexType is considered an XML Schema element)
• the target namespace is our business card namespace
• the b prefix also denotes our business card namespace (such that we can refer to target language constructs from within the schema)
Unfortunately, XML Schema has a rather unconventional use of namespaces:
• prefixes in attribute values (e.g. ref="b:name") - the namespace spec does not tell how to resolve this
• a notion of "unqualified locals" (which is even a default) - allowing prefixes to be omitted from locally declared elements in instance documents
This precludes the use of standard namespace-compliant XML parsers for reading XML Schema documents :-(
Attribute and element defaults
Side-effect of validation: insertion of default values
Each attribute and element declarations can contain a default="..." attribute.
• attribute defaults: are inserted (before validation) if the attribute is absent (in elements of the type containing the declaration)
• element defaults: are inserted as character data in empty elements (of the type of the declaration)
For some strange design reason, element defaults cannot contain markup.
Example:
With a schema containing:
| |
| |
| |
| |
| |
| |
| |
| |
| |
a schema processor will validate and transform:
| |
into:
|no content explicitly provided |
Identity constraints
XPath can be used to specify uniqueness requirements.
Example:
| |
| |
| |
| |
occurring in an element declaration, means that: within each personlist, every ssn attribute of a person element must have a unique value.
Similarly, we can define keys (with key) and references (with keyref) which generalizes the ID/IDREF mechanism from DTD in a straightforward way.
Only a simple subset of XPath is allowed:
• only the child axis and the attribute axis
• only node set expressions
A larger example
A XML Schema description of our recipe collections, recipes.xsd:
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Note that:
• we need to set elementFormDefault="qualified" to use the standard Namespace semantics
• the nonNegativeDecimal and anycontent definitions were not possible with DTD
• we choose to use a mix of global and local definitions
• as with the DTD version, we still cannot express that:
o unit should only be allowed when amount is present
o the comment element should be allowed to appear anywhere
o nested ingredient elements should only be allowed when amount is absent
By inserting the following:
| |
|... |
| |
into our recipe collection recipes.xml, we state that the document is intended to be valid according to recipes.xsd.
Problems with XML Schema
The general problem:
• it is generally too complicated (the spec is several hundred pages in a very technical language), so it is hard to use by non-experts - but many non-experts need schemas to describe intermediate data formats
also, the complicated design necessitates an incomprehensible specification style (example from Part 1, Section 3.3.1: "{value constraint} establishes a default or fixed value for an element. If default is specified, and if the element being ·validated· is empty, then the canonical form of the supplied constraint value becomes the [schema normalized value] of the ·validated· element in the post-schema-validation infoset. If fixed is specified, then the element's content must either be empty, in which case fixed behaves as default, or its value must match the supplied constraint value.", or from Section 3.3.4: "If the item cannot be ·strictly assessed·, because neither clause 1.1 nor clause 1.2 above are satisfied, [Definition:] an element information item's schema validity may be laxly assessed if its ·context-determined declaration· is not skip by ·validating· with respect to the ·ur-type definition· as per Element Locally Valid (Type) (§3.3.4).")
Practical limitations of expressibility:
• cannot require specific root element (so extra information is required to validate even the simplest documents)
• when describing mixed content, the character data cannot be constrained in any way (not even a set of valid characters can be specified)
• content and attribute declarations cannot depend on attributes or element context (this was also listed as a central problem of DTD)
o a typical example that cannot be expressed (actually from the XML Schema spec which is packed with examples): "'default' and 'fixed' may not both be present, and [...] if 'ref' is present, then all of , 'form' and 'type' must be absent"
o a solution to this would also eliminate the need for "nil values"
• it is not 100% self-describing (as a trivial example, see the previous point), even though that was an initial design requirement
• defaults cannot be specified separate from the declarations (this makes it hard to make families of schemas that only differ in the default values)
• element defaults can only be character data (not containing markup)
Technical problems:
• although it technically is namespace conformant, it does not seem to follow the namespace spirit (because of prefixes in attribute values + "unqualified locals")
The major source of complexity:
• the notion of "type" adds an extra layer of confusing complexity:
o in instance documents, we have "elements" which have "element names"
o in schemas, elements are described by "element definitions" which associate "element names" with "type names",
o type definitions associate "type names" with "element descriptions" which describe the elements in the instance documents
(and to cause further confusion, the XML 1.0 spec uses the term "element type" for the name of an element)
• xsi:type attributes are required in instance documents when derived types are being used in place of base types (then one might as well have defined a new element and used a substitution group)
• substitution groups and local declarations (with non-unique names) make it difficult to look up the description of a given element
Non-minimalistic design:
• substitution groups and type derivation seem to be different attempts to solve the same problems
• incorporation of XPath to express uniqueness and keys (neither uniqueness or keys are fundamental concepts for schemas, so dragging in a big language as XPath is overkill)
• the set of built-in data types is not minimalistic (a minimalistic set + some data type libraries would lower the learning burden)
• the use of Perl-style regular expressions violates the principle of using XML syntax to describe XML syntax
For other comments about the design of XML Schema, see for instance pub/a/2000/07/05/specs/lastword.html, xql/tally.html , and pub/a/2002/07/31/wxstypes.html.
Document Structure Description 2.0
- a successor to DSD 1.0, a schema language developed in cooperation by BRICS and AT&T Labs Research.
DSD is designed to:
• contain few and simple language constructs (based on familiar concepts, such as boolean logic and regular expressions)
• be easy to understand, also by non-XML-experts
• have more expressive power than other schema languages for most practical purposes
The central ideas in DSD 2.0:
• a schema consists of a list of rules
• for every element in the instance document, all rules are processed
• rules can conditionally depend on the name, attributes, and context of the current element
• rules contain declare and require sections
• declare sections specify which content (sub-elements and character data) and attributes that are allowed for the current element
• require sections specify extra restrictions on content and attributes
• character data and attribute values are described by regular expressions
Main benefits, compared to XML Schema:
• the complete language specification is only 15 pages (excluding examples)
• no notion of type, rules are directly tied to element names (and no subtyping, substitution groups, or local definitions)
• rules can be hierarchical by depending on attribute values and element context
• DSD is 100% self-describing (so there is a complete "DSD for DSDs")
• lots of non-essential features are removed or reduced to more basic and general constructs
DSD 1.0 was announced in November 1999. A draft spec for DSD 2.0 is now available!
Example
A DSD 2.0 description of our recipe collections:
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Notice in particular:
• the hierarchical rule in the description of ingredient
• the modular definitions of two stringtypes and a rule
• the simple use of namespaces
• it is intuitive and human-readable (if you are used to looking at XML documents :-)
This DSD is more precise than the DTD and the XML Schema descriptions:
• unit is only allowed when amount is present
• the comment element is allowed to appear anywhere
• nested ingredient elements are only allowed when amount is absent
One can check that this is indeed a DSD by validating it with the meta-DSD.
Rules
- a closer look at the central DSD 2.0 construct
Example:
| |
| |
| |
| |
| |
Rules can be:
• if rules, conditional rules guarded by boolean expressions over element properties
• declare rules, declaring which attributes and contents an element may have
• require rules, containing boolean expressions over element properties that are required to hold
(In addition, there are unique and pointer for generalized IDs/IDREFs, normalize for whitespace and case normalization, and default for default attributes and contents.)
Rules can be defined (given an ID for reference) to support modularity, as e.g. ANYCONTENT in the full example.
Boolean expressions
Boolean logic for expressing properties of elements:
• element names and contents
• attribute presence and values
• context properties: parent, ancestor, child, and descendant
combined with and, or, not, impl, etc.
Example:
| |
| |
| |
| |
| |
| |
| |
means: "either the current element has an a attribute with a b value and also a c ancestor element with a d attribute, or - if only looking at e and f elements - its contents consist of one e element followed by one f element (where the elements use the namespace selected by the p prefix)."
Boolean expressions are used both as conditions in conditional constraints and as requirements (in require).
As with the other syntactic categories, boolean expressions can be defined for modularity.
Regular expressions
Both attributes and character data are described by regular expressions over the Unicode alphabet.
Regular expressions can be built from:
• constant strings, character sets
• sequencing, union, repetition
• complement, intersection (!)
• boolean expressions (which describe elements)
• ...
As with constraints, regular expressions can be defined for modularity.
Example:
|... |
| |
| |
| |
| |
|... |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Libraries of common expressions can be made with the import feature described later...
If more than one regular expression is declared for the contents of an element, they are implicitly merged in an unordered
Inclusion and extension
To enhances reusability, maintainability, and readability, DSD descriptions can consist of several XML documents.
DSD 2.0 simply relies on XInclude for composing DSD fragments into complete specifications. (However, full XPointer is not used - only simple URLs that denote whole documents.)
This, combined with the notion of conditional rules, makes it easy to write modular specifications, reuse and extend existing schemas, and create families of related schemas.
Links to more information
TR/xmlschema-0
XML Schema Part 0: Primer (a non-normative introduction)
TR/xmlschema-1
XML Schema Part 1: Structures
TR/xmlschema-2
XML Schema Part 2: Datatypes
brics.dk/DSD
the DSD homepage
cover/schemas.html
Robin Cover's XML schema information
pub/1999/12/dtd
article on schema languages
pub/a/2000/11/29/schemas/part1.html
introduction to XML Schema
BestPracticesHomepage.html
"best practices" of XML Schema
xml/books/bible2/chapters/ch24.html
chapter from "XML Bible" on XML Schema
read.php?item=1097
"W3C XML Schema still has big problems", article on
cobase.cs.ucla.edu/tech-docs/dongwon/ucla-200008.html
"Comparative Analysis of Six XML Schema Languages"
schemavalid/faq/xml-schema.html
XML Schema FAQ
xml.
Apache's Xerces parser and validator
XLink, XPointer, and XPath
- linking and addressing
Overview:
• XLink, XPointer, and XPath - three layers of languages
XLink:
• Problems with HTML links - why do we need something new?
• The XLink linking model - a generalization of HTML links
• An example - a link between two remote resources
• Linking elements - defining links
• Behavior - show and actuate
• Simple vs. Extended links - compatibility issues
• HLink vs. XLink - a new alternative?
XPointer, Part I - using XPointer in XLink:
• XPointer: Why, what, and how? - introduction
• XPointer vs. XPath - what is the difference
• XPointer fragment identifiers - the structure of an XPointer
XPath:
• Location paths - the central construct
• Location steps - expressing node-sets
o Axes - selecting candidates
o Node tests - initial filtration
o Predicates - fine-grained filtration
• Expressions - a little expression language
• Core function library - the built-in functions
• Abbreviations - convenient notation
• XPath visualization - a useful tool
• XPath examples - continuing the recipe example
• XPath 2.0 - the next version
XPointer, Part II - how XPointer uses XPath:
• Context initialization - filling out the gap between XPath and XLink
• Extra XPointer features - generalizing XPath
Selected links:
• Tools
• Links to more information
XLink, XPointer, and XPath
- imagine a Web without links...
Three layers:
• XLink
o a generalization of the HTML link concept
o higher abstraction level (intended for general XML - not just hypertext)
o more expressive power (multiple destinations, special behaviors, linkbases, ...)
o uses XPointer to locate resources
• XPointer
o an extension of XPath suited for linking
o specifies connection between XPath expressions and URIs
• XPath
o a declarative language for locating nodes and fragments in XML trees
o used in both XPointer (for addressing), XSL (for pattern matching), XML Schema (for uniqueness and scope descriptions), and XQuery (for selection and iteration)
These technologies are standardized but not all widely implemented yet.
XQuery vs. XPointer/XPath? Reminiscent, but different goals:
• XQuery: SQL-like database queries
• XPointer/XPath: robust addressing into known information
Problems with HTML links
The HTML link model:
[pic]
Construction of a hyperlink:
• is placed at the destination
• is placed at the source
Problems when using the HTML model for general XML:
• Link recognition:
o in HTML, links are recognized by element names (a, img, ..)
- we want a generic XML solution
o the "semantics" of a link is defined in the HTML specification
- we want to identify abstract semantic features, e.g. link actuation
• Limitations:
o an anchor must be placed at every link destination (problem with read-only documents)
- we want to express relative locations (XPointer!)
o the link definition must be at the same location as the link source (outbound)
- we want inbound and third-party links
o only individual nodes can be linked to
- we want links to whole tree fragments
o a link always has one source and one destination
- we want links with multiple sources and destinations
The usual point: generic solutions allow generic tools!
The XLink linking model
Basic XLink terminology:
Link: explicit relationship between two or more resources.
Linking element: an XML element that asserts the existence and describes the characteristics of a link.
Locator: an identification of a remote resource that is participating in the link.
[pic]
One linking element defines a set of traversable arcs between some resources.
A local resource comes from the linking element's own content.
Outbound: the source is a local resource
Inbound: the destination is a local resource
Third-party: none of the resources are local
Third-party links can be used to construct shared link bases for browsers.
An example
A linking element defining a third-party "extended" link involving two remote resources:
| |
| |
| |
| |
| |
• the namespace is used to recognize XLink information in general XML documents
o the namespace often (but not necessarily) uses namespace prefix xlink
o host language: elements and attributes not belonging to this namespace are ignored by XLink processors
o all XLink information is defined in attributes (in host language elements)
• xlink:type="extended" indicates a linking element
• xlink:type="locator" locates a remote resource
• xlink:type="arc" defines traversal rules
A powerful example application of general XLinks:
Using third-party links and a smart browser, a group of people can annotate Web pages with "post-it notes" for discussion - without having write access to the pages. They simply need to agree on a set of URIs to XLink link bases defining the annotations. The smart XLink-aware browser lets them select parts of the Web pages (as XPointer ranges), comment the parts by creating XLinks to a small XHTML documents, view each other's comments, place comments on comments, and perhaps also aid in structuring the comments.
Linking elements
- defining links
All elements with XLink information contain an xlink:type attribute.
• a general linking element is defined using an xlink:type="extended" attribute; this element can contain the following:
• a local resource is defined with xlink:type="resource"
• a remote resource is defined with xlink:type="locator" and with an xlink:href attribute (an XPointer expression locating the resource)
• arcs (traversal rules) are defined with xlink:type="arc":
o both "resource" and "locator" elements can have xlink:label attributes
o an arc element has an xlink:from and an xlink:to attribute
o the "arc" element defines a set of arcs: from each resource having the from label to each resource having the to label
(Note the confusing terminology: a resource is defined either by a "resource" element or by a "locator" element.)
XPointer is described later - just think of XPointer expression as URIs for now...
Behavior
- link semantics
Arcs can be annotated with abstract behavior information using the following attributes:
xlink:show - what happens when the link is activated?
Possible values:
embed
insert the presentation of the target resource (the one at the end of the arc) in place of the source resource (the one at the beginning of the arc, where traversal was initiated) (example: as images in HTML)
new
display the target resource some other place without affecting the presentation of the source resource (example: as target="_blank" in an HTML link)
replace
replace the presentation of the resource containing the source with a presentation of the destination (example: as normal HTML links)
other
behavior specified elsewhere
none
no behavior is specified
xlink:actuate - when is the link activated?
Possible values:
onLoad
traverse the link immediately when recognized (example: as HTML images)
onRequest
traverse when explicitly requested (example: as normal HTML links)
other
behavior specified elsewhere
none
no behavior is specified
Note: these notions of link behavior are rather abstract and do not make sense for all applications.
Semantic attributes: describe the meaning of link resources and arcs
xlink:title
provide human readable descriptions (also available as xlink:type="title" to allow markup)
xlink:role and xlink:arcrole
URI references to descriptions
Simple vs. Extended links
- for compatibility and simplicity
Two kinds of links:
• extended - the general ones we have seen so far
• simple - a restricted version of extended links: only for two-ended outbound links (enough for HTML-style links)
Convenient shorthand notation for simple links:
| |
is equivalent to:
| |
| |
| |
| |
| |
Many XLink properties (e.g. xlink:type and xlink:show) can conveniently be specified as defaults in the schema definition!
When should I use XLink? Tim Berners-Lee: only for hypertext linking (Not everybody agree...)
HLink vs. XLink
Why is XHTML not using XLink?
The problem:
• we want a general mechanism for identifying links
• ...but we want full control when designing the syntax of the host languages
When integrating XLink in a host language, the use of the XLink namespace makes a mess.
HLink: a recent alternative to XLink
• same underlying ideas
• different syntax
Example HLink: Definition of the link semantics of elements in XHTML.
| |
13 September 2002: W3C's HTML Working Group publishes HLink draft for intended use in XHTML 2.0
24 September 2002: W3C's Technical Architecture Group rejects HLink in favor of XLink for the design of XHTML 2.0
XPointer: Why, what, and how?
• an extension of XPath which is used by XLink to locate remote link resources
• relative addressing: allows links to places with no anchors
• flexible and robust: XPointer/XPath expressions often survive changes in the target document
• can point to substrings in character data and to whole tree fragments
Example of an XPointer:
| URI |
|----------------------------------------------------------------- |
|/ \ |
|(article/section[position() ................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related searches
- home contents inventory worksheet
- insurance contents list sample
- powershell list contents folder
- excel vba copy cell contents to clipboard
- contents of zambian bill 10
- contents of iv fluids
- powershell clear contents of file
- who buys contents of home
- get folder contents powershell
- copy folder and contents cmd
- copy all contents of directory cmd
- copy contents of directory cmd