What is XML?
XML is the Extensible Markup Language. It is designed to improve the functionality of the Web by providing more flexible and adaptable information identification.
It is called extensible because it is not a fixed format like HTML (a single, predefined markup language). Instead, XML is actually a `meta-language' --a language for describing other languages—that lets you design your own customized markup languages for limitless different types of documents. XML can do this because it's written in SGML, the international standard meta-language for text markup systems (ISO 8879).
What is XML for?
XML is intended `to make it easy and straightforward to use SGML on the Web: easy to define document types, easy to author and manage SGML-defined documents, and easy to transmit and share them across the Web.'
It defines `an extremely simple dialect of SGML which is completely described in the XML Specification. The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.'
`For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML'. XML is not just for Web pages: it can be used to store any kind of structured information, and to enclose or encapsulate information in order to pass it between different computing systems which would otherwise be unable to communicate.
What is SGML?
SGML is the Standard Generalized Markup Language (ISO 8879:1985), the international standard for defining descriptions of the structure of different types of electronic document. SGML is very large, powerful, and complex. It has been in heavy industrial and commercial use for over a decade, and there is a significant body of expertise and software to go with it. XML is a lightweight cut-down version of SGML that keeps enough of its functionality to make it useful but removes all the optional features that make SGML too complex to program for in a Web environment.
What is HTML?
HTML is the HyperText Markup Language (RFC 1866), a small application of SGML used on the Web. It defines a very simple class of report-style documents, with section headings, paragraphs, lists, tables, and illustrations, with a few informational and presentational items, and some hypertext and multimedia. There is also an XML version of HTML.
Aren't XML, SGML, and HTML all the same thing?
Not quite; SGML is the mother tongue, and has been used for describing thousands of different document types in many fields of human activity, from transcriptions of ancient Irish manuscripts to the technical documentation for stealth bombers, and from patients' clinical records to musical notation. SGML is very large and complex, however, and probably overkill for most common applications.
XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them. It omits all the options, and most of the more complex and less-used parts of SGML in return for the benefits of being easier to write applications for, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files may still be processed in the same way as any other SGML file (see the question on XML software).
HTML is just one of the SGML or XML applications, the one most frequently used in the Web.
Technical readers may find it more useful to think of XML as being SGML-- rather than HTML++.
Why is XML such an important development?
It removes two constraints, which were holding back Web developments:
1. Dependence on a single, inflexible document type (HTML) which was being much abused for tasks it was never designed for;
2. The complexity of full SGML, whose syntax allows many powerful but hard-to-program options.
XML allows the flexible development of user-defined document types. It provides a robust, non-proprietary, persistent, and verifiable file format for the storage and transmission of text and data both on and off the Web; and it removes the more complex options of SGML, making it easier to program for.
C.3 What does an XML document look like inside?
The basic structure is very similar to most other applications of SGML, including HTML. XML documents can be very simple, with no document type declaration (DTD), and straightforward nested markup of your own design:
Stop the planet, I want to get off!
Or they can be more complicated, with a DTD specified, and maybe an internal subset (local DTD changes in [square brackets]), and a more complex structure:
Or they can be anywhere between: a lot will depend on how you want to define your document type (or whose you use) and what it will be used for.
How does XML handle white space in my documents?
The SGML rules regarding white space have been changed for XML. All white-space, including line breaks, TAB characters, and regular spaces, even between those elements where no text can ever appear, is passed by the parser unchanged to the application (browser, formatter, viewer, converter, etc), identifying the context in which the white-space was found (element content, data content, or mixed content). This means it is the application's responsibility to decide what to do with such space, not the parsers:
¨ Insignificant white-space between structural elements (space which occurs where only element content is allowed, i.e. between other elements, where text data never occurs) will get passed to the application (in SGML this white-space gets suppressed, which is why you can put all that extra space in HTML documents and not worry about it. This is not so in XML);
¨ Significant white space (space which occurs within elements which can contain text and markup mixed together, usually mixed content or PCDATA) will still get passed to the application exactly as under SGML. It is the application's responsibility to handle it correctly.
My title for Section 1.
The parser must inform the application that white space has occurred in element content, if it can detect it. (Users of SGML will recognize that this information is not in the ESIS, but it is in the Grove.) In the above example, the application will receive all the pretty-printing line breaks, Tabs, and spaces between the elements as well as those embedded in the chapter title. It is the function of the application, not the parser, to decide which type of white space to discard and which to retain.
What's a Document Type Definition (DTD) and where do I get one?
A DTD is a formal description in XML Declaration Syntax of a particular type of document. It sets out what names are to be used for the different types of element, where they may occur, and how they all fit together. For example, if you want a document type to be able to describe
Lists that contain
Items, the relevant part of your DTD might contain something like this:
This defines a list as an element type containing one or more items (that's the plus sign); and it defines items as element types containing just plain text (Parsed Character Data or PCDATA). Validating parsers read the DTD before they read your document so that they can identify where every element type ought to come and how each relates to the other, so that applications which need to know this in advance (most editors, search engines, navigators, databases) can set themselves up correctly. The example above lets you create lists like:
How the list appears in print or on the screen depends on your stylesheet: you do not normally put anything in the XML to control formatting like you had to do with HTML before stylesheets. This way you can change style easily without ever having to edit the document itself.
A DTD provides applications with advance notice of what names and structures can be used in a particular document type. Using a DTD when editing files means you can be certain that all documents which belong to a particular type will be constructed and named in a consistent and conformant manner. DTDs are less important for processing documents already known to be well formed, but they are still needed if you want to take advantage of XML's special attribute types like the built-in ID/IDREF cross-reference mechanism.
There are thousands of DTDs already in existence in all kinds of areas (see the SGML/XML Web pages for pointers). Many of them can be downloaded and used freely; or you can write your own (see the question on creating your own DTD. Existing SGML DTDs need to be converted to XML for use with XML systems: read the question on converting SGML DTDs to XML, and expect to see announcements of popular DTDs becoming available in XML format.
How do I create my own DTD?
You need to use the XML Declaration Syntax (very simple: declaration keywords begin with
rather than just the open angle bracket, and the way the declarations are formed also differs slightly). Here's an example of a DTD for a shopping list, based on the fragment used in an earlier question:
It says that there shall be an element called
Shopping-List and that it shall contain elements called
Item: there must be at least one (that's the plus sign) but there may be more than one. It also says that the
Item element may contain parsed character data (PCDATA, ie text).
Because there is no other element which contains
Shopping-List, that element is assumed to be the `root' element, which encloses everything else in the document. You can now use it to create an XML file: give your editor the declarations:
(assuming you put the DTD in that file). Now your editor will let you create files according to the pattern:
It is possible to develop complex and powerful DTDs of great subtlety, but for any significant use you should learn more about document systems analysis and document type design. See for example Developing SGML DTDs by Maler and el Andaloussi, Prentice Hall, 1997, 0-13-309881-8, which was written for SGML, but perhaps 95% of it applies to XML as well, as XML is much simpler than full SGML--see the list of restrictions which shows what has been cut out.
Can a root element type be explicitly declared in the DTD?
Bob DuCharme writes: No. This is done in the document's Document Type Declaration, not in the DTD. In a Document Type Declaration like this:
The whole point of the `chapter' part is to identify which of the element types declared in the specified DTD should be used as the root element (also known as the `document element'--the element to be used to enclose the whole document). I believe the highest level element in DocBook is `set', but I find it hard to imagine someone creating a document to represent a set of books. We are free to use set, book, chapter, article, or even para as the document element for a valid DocBook document.
[One job some parsers do is determine which element type[s] in a DTD are not contained in the content model of any other element type: these are by deduction the prime candidates for being default root elements. (PF)]
This is A Good Thing, because it adds flexibility to how the DTD is used. It's the reason that XML (and SGML) have lent themselves so well to electronic publishing systems in which different elements were mixed and matched to create different documents all conforming to the same DTD.
I've seen schema proposals that let you specify which of a schema's element types could be a document's root element, but after a quick look at section 3.3 of Part 1 of the W3C Schema Recommendation and the RELAX NG schema for RELAX, I don't believe that either of these let you do this. I could be wrong.
I keep hearing about alternatives to DTDs. What's a Schema?
A DTD is for specifying the structure (only) of an XML file: it gives the names of the elements, attributes, and entities that can be used, and how they fit together. DTDs are designed for use with traditional text documents, not rectangular or tabular data, so the concept of data types does not exist: text is just text. If you need to specify numeric ranges or to define limitations or checks on the text content, a DTD is the wrong tool.
The W3C XML Schema recommendation provides a means of specifying element content in terms of data types, so that document type designers can provide criteria for validating the content of elements as well as the markup itself. Schemas are written as XML files, avoiding the need for processing software to be able to read XML Declaration Syntax, which is different from XML Instance Syntax.
Schemas are a formal W3C Recommendation, and a number of sites are serving useful applications as both DTDs and Schemas, e.g. http://www.schema.net and http://www.dtd.com. The term `vocabulary' is sometimes used to refer to `DTDs and Schemas' together. Designers should note that Schemas are aimed at database-style applications where element data content requires validation: they are inappropriate for traditional text publishing applications.
Authors and publishers should note that the plural of Schema is Schemas: the use of the singular to do duty for the plural is a foible dear to the semi-literate; the use of the old (Greek) plural schemata is now unnecessary didacticism. Writers should also note that the plural of DTD is DTDs: there is no apostrophe.
Bob DuCharme adds: Many XML developers were dissatisfied with the syntax of the markup declarations described in the XML spec for two reasons. First, they felt that if XML documents were so good at describing structured information, then the description of a document type's own structure (its schema) should be in an XML document instead of written with its own special syntax. In addition to being more consistent, this would make it easier to edit and manipulate the schema with regular document manipulation tools. Secondly, they felt that traditional DTD notation didn't allow document type designers the power to impose enough constraints on the data--for example, the ability to say that a certain element type must always have a positive integer value, that it may not be empty, or that it must be one of a list of possible choices. This eases the development of software using that data because the developer has less error-checking code to write.
How will XML affect my document links?
The linking abilities of XML systems are much more powerful than those of HTML, so you'll be able to do much more with them. Existing
HREF-style links will remain usable, but the new linking technology is based on the lessons learned in the development of other standards involving hypertext, such as TEI and HyTime, which let you manage bidirectional and multi-way links, as well as links to a span of text (within your own or other documents) rather than to a single point. These features have been available to SGML users for many years, so there is considerable experience and expertise available in using them.
The XML Linking Specification (XLink) and XML Extended Pointer Specification (XPointer) documents contain the details. An XML link can be either a URL or a TEI-style Extended Pointer (XPointer), or both. A URL on its own is assumed to be a resource; if an XPointer or XLink follows it, it is assumed to be a sub-resource of that URL; an XPointer on its own is assumed to apply to the current document (all exactly as with HTML).
An XLink is always preceded by one of
? mean the same as in HTML applications; the
| means the sub-resource can be found by applying the link to the resource, but the method of doing this is left to the application. An XPointer can only follow a
The TEI Extended Pointer Notation (EPN) is much more powerful than the fragment address on the end of some URLs, as it allows you to specify the location of a link end using the structure of the document as well as (or in addition to) known, fixed points like IDs. For example, the linked second occurrence of the word `XPointer' two paragraphs back could be referred to as http://www.ucc.ie/xml/faq.sgml#ID(hypertext).child(2,*).child(2,#element,'p').child(3,#element,'link'), meaning the third link element within the second paragraph within the second object in the element whose ID is
hypertext (this question). Count the objects from the start of this question in the XML source (which has the ID
1. the first child object is the title of the question (
4. count to the third link.
David Megginson has produced an
xpointer function for Emacs/psgml, which will deduce an XPointer for any location in an XML document.
How does XML handle metadata?
Because XML lets you define your own markup language, you can make full use of the extended hypertext features (see the question on Links) of XML to store or link to metadata in any format (e.g. ISO 11179, Dublin Core, Warwick Framework, Resource Description Framework (RDF), and Platform for Internet Content Selection (PICS)).
There are no predefined elements in XML, because it is an architecture, not an application, so it is not part of XML's job to specify how or if authors should or should not implement metadata. You are therefore free to use any suitable method from simple attributes to the embedding of entire Dublin Core/Warwick Framework metadata records. Browser makers may also have their own architectural recommendations or methods to propose.
How do I control appearance?
In HTML, default styling was built into the browsers because the tagset of HTML was predefined and hardwired into browsers. IN XML, where you can define your own tagset, browsers cannot possibly be expected to guess or know in advance what names you are going to use and what they will mean, so you need a stylesheet if you want to display formatted text.
Browsers which read XML will probably accept and use a CSS stylesheet at a minimum, but you can also use the more powerful XSLT stylesheet language to transform your XML into HTML--which browsers, of course, already know how to display (and that HTML can still use a CSS stylesheet).
This transformation into HTML can be done either inside the browser, or by the server before the file is sent. Transformation in the browser offloads the processing from the server, but may introduce browser dependencies, leading to some of your readers being excluded. Transformation in the server makes the process browser-independent, but places a heavier processing load on the server.
As with any system where files can be viewed at random by arbitrary users, the author cannot know what resources (such as fonts) are on the user's system, so the same care is needed as with HTML using fonts. To invoke a stylesheet from an XML file, include one of the stylesheet declarations:
The Cascading Stylesheet Specification (CSS) provides a simple syntax for assigning styles to elements, and has been implemented in most browsers.
The Extensible Stylesheet Language (XSL) has been created for use specifically with XML. XSL uses XML syntax (an XSL stylesheet is an XML file) and has widespread support from several major vendors (see the questions on browsers and other software) although current browser support is limited. XSL comes in two flavors:
¨ XSL itself, which is a pure formatting language, and which needs a text formatter like FOP, PassiveTeX, or XEP to create printable output (in PDF). Currently I am not aware of any Web browsers that support direct XSL rendering;
¨ XSLT (T for Transformation) is a language to specify transformations of XML into HTML either inside the browser or at the server before transmission. It can also specify transformations from one vocabulary of XML to another, and from XML to plaintext (which can be any format, including RTF and LaTeX).
Currently only Microsoft IE 5.5+ and Mozilla 0.9.6+ handle XSLT inside the browser (MSIE5.5 needs some post-installation surgery to remove the obsolete WD-XSL and replace it with the current XSL-Transform processor; MSIE6 and Mozilla work as delivered). But there is a growing use of server-side processors like Cocoon and PropelX, which let you store your information in XML but serve it auto-converted to HTML, thus allowing the output to be used by any browser. XSLT is also widely used to transform XML into non-SGML formats for input to other systems (for example to transform XML into LaTeX for typesetting.
How do I use graphics in XML?
Graphics have traditionally just been links, which happen to have a picture file at the end rather than another piece of text. They can therefore be implemented in any way supported by the XLink and XPointer specifications, including using similar syntax to existing HTML images. They can also be referenced using XML's built-in
ENTITY mechanism in a similar way to standard SGML, as external unparsed entities.
The linking specifications, however, give you much better control over the traversal and activation of links, so an author can specify, for example, whether or not to have an image appear when the page is loaded, or on a click from the user, or in a separate window, without having to resort to scripting.
XML itself doesn't predict or restrict graphic file formats: GIF, JPG, TIFF, PNG, CGM, and SVG at a minimum would seem to make sense; however, vector formats are normally preferred for non-photographic images.
Using entities for images
You cannot embed a raw binary graphics file (or any other binary [non-text] data) directly into an XML file because any bytes happening to resemble markup would get misinterpreted: you must refer to it by linking (see below). It would, however, in theory be possible to include a text-encoded transformation of a binary file as a
CDATA Marked Section, using something like UUencode with the markup characters
> removed from the map so that they could not occur and be misinterpreted, or even simple hexadecimal encoding as used in PostScript. For vector graphics, however, the solution is to use SVG.
Bob DuCharme adds: All the data in an XML document entity must be parseable XML. You can define an external entity as either a parsed entity (parseable XML) or an unparsed entity (anything else). Unparsed entities can be used for picture files, sound files, movie files, or whatever you like. They can only be referenced from within a document as the value of an attribute (much like a bitmap picture on an HTML Web page is the value of the
src attribute) and not part of the actual document. In an XML document, this attribute must be declared to be of type
ENTITY, and the entity's declaration must specify a declared
NOTATION, because if the entity isn't XML, the XML processor needs to know what it is. For example, in the following document, the
colliepic entity is declared to have a JPEG notation, and it's used as the value of the empty dog element's
The XLink and XPointer linking specifications describe other ways to point to a non-XML file such as a graphic. These offer more sophisticated control over the external entity's position, handling, and appearance within the XML document.
Scalable Vector Graphics (SVG)
Peter Murray-Rust writes: GIFs and JPEGs cater for bitmaps (pixel representations of images: all made up of colored dots). Vector graphics (scalable, made up of drawing specifications) are being addressed in the W3C's graphics activity as Scalable Vector Graphics. [With the specification now virtually complete,] it will be possible to transmit the graphical representation as vectors directly within the XML file. For many graphics objects this will mean greatly decreased download time and scaling without loss of detail.
Max Dunn writes: SVG has really taken off recently, and is quite an XML success story [...] there are already nearly conformant implementations.
XSLT can be used to generate SVG from XML; details are at http://www.siliconpublishing.org/svgfaq/XSLT.asp (be careful to use XSLT, not Microsoft's obsolete WD-xsl). Documents can also interact with SVG images (see http://www.xml.com/pub/a/2000/03/22/style/index.html).
What are these terms DTDless, valid, and well-formed?
XML lets you use a Document Type Definition (DTD) to describe the markup (elements and other constructs) available in any specific type of document. However, the design and construction of a DTD can be complex and non-trivial, so XML also lets you work without a DTD. DTDless operation means you can invent markup without having to define it formally, provided you stick to the rules of XML syntax.
To make this work, a DTDless file is assumed to define its own markup by the existence and location of elements where you create them. When an XML application encounters a DTDless file, it builds its internal model of the document structure while it reads it, because it has no DTD to tell it what to expect. There must therefore be no surprises or ambiguous syntax: the document must be `well-formed' (must follow the rules).
To understand why this concept is needed, look at standard HTML as an example:
element, which is defined (in the SGML DTDs for HTML) as
EMPTY, doesn't have an end-tag (there is no such thing as
); and many other HTML elements (such as
) allow you to omit the end-tag for brevity.
¨ If an XML processor reads an HTML file without knowing this (because it isn't using a DTD), and it encounters
or many other start-tags, it would have no way to know whether or not to expect an end-tag, which makes it impossible to know if the rest of the file is correct or not, because it has now lost track of whether it is inside an element or if it has finished with it.
Well-formed documents therefore require start-tags and end-tags on every normal element, and any
EMPTY elements must be made unambiguous, either by using normal start-tags and end-tags, or by affixing a slash to the start-tag before the closing
> as a sign that there will be no end-tag.
All XML documents, both DTDless and valid, must be well formed. They must start with an XML Declaration if necessary (for example, identifying the character encoding or using the Standalone Document Declaration):
David Brownell notes: XML that's just well formed doesn't need to use a Standalone Document Declaration at all. Such declarations are there to permit certain speedups when processing documents while ignoring external parameter entities--basically; you can't rely on external declarations in standalone documents. The types that are relevant are entities and attributes. Standalone documents must not require any kind of attribute value normalization or defaulting, otherwise they are invalid.
Rules for well-formedness:
· All tags must be balanced: that is, every elment which may contain character data or sub-elements must have both the start-tag and the end-tag present (omission is not allowed except for empty elements, see below);
· All attribute values must be in quotes. The single-quote character (the apostrophe) may be used if the value contains a double-quote character, and vice versa. If you need isolated quotes as data as well, you can use
". Do not under any circumstances use the automated typographic ( `curly' ) inverted commas substituted by some wordprocessors for quoting attribute values.
EMPTY elements (eg those with no end-tag like HTML's
and others) must either end with
/>or they must look like non-
EMPTY elements by having a real end-tag (but no content). Example:
would become either
(with nothing in between).
· There must not be any isolated markup-start characters (
&) in your text data. They must be given as
& respectively, and the sequence
]]> may only occur as the end of a
CDATA marked section: if you are using it for any other purpose it must be given as
· Elements must nest inside each other properly (no overlapping markup, same as for HTML);
· DTDless well-formed documents may use attributes on any element, but the attributes are all assumed to be of type CDATA. You cannot use ID/IDREF attribute types for parser-checked cross-referencing in DTDless documents.
· XML files with no DTD are considered to have
& predefined and thus available for use. With a DTD, all character entities must be declared, including these five. If you need other character entities in a DTDless file, you can declare them in an internal subset without referencing anything other than the root element type:
Hindsight—a wonderful thing.
A valid file begins with a Document Type Declaration, but may have an optional XML Declaration prepended:
The XML Specification predefines an SGML Declaration for XML which is fixed for all instances and is therefore hard-coded into most XML software (the declaration has been removed from the text of the Specification and is now in a separate document). The specified DTD must be accessible to the XML processor using the URL supplied in the SYSTEM Identifier, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network.
It is possible (many people would say preferable) to supply a Formal Public Identifier with the PUBLIC keyword, and use an XML Catalog to dereference it, but the Specification mandates a SYSTEM Identifier so this must still be supplied (after the PUBLIC identifier: no further keyword is needed):
The test for validity is that a validating parser finds no errors in the file: it must conform absolutely to the definitions and declarations in the DTD.
Which should I use in my DTD, attributes or elements?
There is no single answer to this: a lot depends on what you are designing the document type for.
Traditional editorial practice is to put the real text (what would be printed) as character data content, and keep the metadata (information about the text) in attributes, from where they can more easily be isolated for analysis or special treatment like display in the margin or in a mouseover:
Portia The quality of mercy is not strain'd,
But from the systems point of view, there is nothing wrong with storing the data the other way round, especially where the volume of text data on each occasion is relatively small:
A lot will depend on what you want to do with the information and which bits of it are easiest accessed by each method. A rule of thumb for conventional text documents is that if the markup were all stripped away, the bare text should still be readable and usable, even if unformatted and inconvenient. For database output, however, or other machine-generated documents like e-commerce transactions, human reading may not be meaningful, so it is perfectly possible to have documents where all the data is in attributes, and the document contains no character data in content models at all.
What's a namespace?
Randall Fowle writes: A namespace is a collection of element and attribute names identified by a Uniform Resource Identifier reference. The reference may appear in the root element as a value of the
xmlns attribute. For example, the namespace reference for an XML document with a root element
x might appear like this:
More than one namespace may appear in a single XML document, to allow a name to be used more than once. Each reference can declare a prefix to be used by each name, so the previous example might appear as
, which would nominate the namespace for the `spc' prefix:
James Anderson adds: in general, note that the binding may also be effected by a default value for an attribute in the DTD.
§ The reference does not need to be a physical file; it is simply a way to distinguish between namespaces. The reference should tell a person looking at the XML document where to find definitions of the element and attribute names using that particular namespace.
How do I include one DTD (or fragment) in another?
This works exactly the same as for SGML. First you declare the entity you want to include, and then you reference it by name as a parameter:
Such declarations traditionally go all together towards the top of the main DTD file, where they can be managed and maintained, but this is not essential so long as they are declared before they are used. You use Parameter Entity Syntax for this (the percent sign) because the file is to be included at DTD compile time, not when the document instance itself is parsed.
Note that a URL is compulsory in XML as the System Identifier for all external file references: standard rules for dereferencing URLs apply (assume the same method, server, and directory as the containing document). A Formal Public Identifier can also be used, following the same rules as elsewhere.