LearningXML

From HerzbubeWiki
Jump to navigation Jump to search

On this page I keep notes about XML, the Extensible Markup Language.

Because I already know quite a few things about XML, the notes on this page are rather spotty - instead of representing a "learning from the grounds up" experience, they focus on those things that I do not yet know, or that I find noteworthy enough to jot down here.

I have started this page because I plan to get an education in web programming. In the context of this education programme, the previous learning page before this one is LearningJavaScript.


References


Glossary

Attribute minimization
An SGML practice which allows that certain attributes can be reduced to just the attribute value. Example: The attribute "checked" can be written without apparent value assignment. Attribute minimization is not allowed in XHTML, there you have to write checked="checked".
Data facet
A restriction on data defined in an XML schema.
Prolog
Another name for the XML declarationn that is at the top of an XML document.
XML
Extensible Markup Language. There are two major versions of the standard: XML 1.0 and XML 1.1.
XML 1.0
Version 1 of the XML language. The latest revision is the fifth edition from November 2008. XML 1.0 is widely used and has not yet been deprecated - actually it is the version of XML that is officially recommended for use if one does not need the special features from XML 1.1.
XML 1.1
Version 1.1 of the XML language. The latest revision is the second edition from August 2006. XML 1.1 is not very widely used, and although it technically supersedes XML 1.0 it is not recommended for use unless one needs its special features. See the Wikipedia page (link is in the "References" section) for an overview of the differences between the two versions.
XPath
XML Path Language. A syntax that allows to address the components (elements, attributes, etc.) in an XML document.
XSD
XML Schema Definition.
XSL
Extensible Stylesheet Language.


Well-formed vs. valid

A well-formed XML document is a document that conforms to a set of universal syntax rules of the language. The following covers a lot of ground:

  • Element names must start with a letter or underscore. They can contain letters, digits, hyphens, underscores, and periods. They cannot contain spaces. They must not start with "xml" in any combination of upper/lowercase.
  • Every opening tag must have a closing tag. Note that XML is case sensitive!
  • Tags must be properly nested.
  • Attribute values must be enclosed in quotes, either single or double quotes.
  • An XML document must contain one root element that is the parent of all the other elements.
  • If an XML declaration exists, it must appear before any of the other content of the document.
  • The "lesser-than" character (<) and the ampersand character (&) must be encoded as an entity reference (&lt; and &amp;, respectively) because in their normal form they have special significance to an XML parser.
  • Lines must be terminated by a LF (line feed) character (ASCII code value 10).
  • A comment must not contain two dashes (--) in the middle.


An XML document is valid when its structure and content conforms to the rules specified by a so-called "schema". Because XML has no predefined elements, an XML document's validity cannot be determined unless a schema exists which says which elements can/must exist in an XML document, what kind of data can/must appear in each of the elements, etc.


XML declaration (Prolog)

The XML declaration, sometimes also called "the prolog" is the snippet that appears at the top of an XML document. A typical XML declaration looks like this:

<?xml version="1.0" encoding="UTF-8"?>

Notes:

  • XML 1.1 requires that an XML declaration is present, but in XML 1.0 it is optional.
  • If an XML declaration is present, it must appear before the root element at the very top of the document.
  • The declaration can have a "standalone" directive, which apparently is only relevant if the XML document contains a reference to a DTD. For more details see this StackOverflow question.


Namespaces

In this example, there are two distinct elements "foo": The first is coming from namespace "ns1", the second is coming from namespace "ns2". Although both elements have the same name, they can have totally different meanings.

<ns1:foo>...</ns1:foo>
<ns2:foo>...</ns2:foo>

In the example above, "ns1" and "ns2" are not actually namespaces - they are prefixes. Prefixes are used as shortcuts to the actual namespaces. Prefixes and namespaces must be declared. A namespace can be defined by an xmlns attribute in the opening tag of an element. A namespace declaration has the following syntax:

xmlns:prefix="URI"

In this example, the prefix "h" is a shortcut to the namespace http://www.w3.org/TR/html4/:

<h:table xmlns:h="http://www.w3.org/TR/html4/"><h:tr>...</h:tr></h:table>

If the prefix is omitted from a namespace declaration, all child elements will be assigned to that default namespace:

<table xmlns="http://www.w3.org/TR/html4/"><tr>...</tr></table>

Some additional notes about namespaces:

  • When a namespace is defined for an element, all child elements with the same prefix are associated with the same namespace.
  • Namespaces can also be declared in the XML root element
  • An element can have several namespace declarations, but each declaration must use a different prefix
  • The namespace URI is not used by the XML parser to look up information! The purpose of using an URI is to give the namespace a unique name. However, companies often use the namespace as a pointer to a human-readable web page containing namespace information.


The Schema

Basic schema document

A schema document typically looks something likes this:

<xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    targetNamespace="https://foo.com"
    elementFormDefault="qualified"
    xmlns="https://foo.com">
  [...]
</xs:schema>

Discussion:

  • The namespace http://www.w3.org/2001/XMLSchema is a standardized namespace for XML schemas. The "xs" prefix is a convention used for this namespace. It probably is the acronym for "XML Schema". Sometimes the "xsd" prefix is used instead of "xs".
  • The root element of an XSD is "schema"
  • The root element and all other elements that together form the schema declaration come from the namespace referenced by the "xs" prefix.
  • The "targetNamespace" attribute indicates which namespace the elements and attributes defined by the schema are coming from. Note: This is optional, a schema does not need to have a target namespace.
  • The "elementFormDefault" attribute defines the default value of the "form" attribute that can be specified optionally on every element in the XSD schema. The default value of "elementFormDefault" is "unqualified". It is almost always a good idea to specify "qualified" for the "elementFormDefault" attribute. The "form" attribute of an element declaration in a schema tells an XML validator which namespace must be set on elements in the XML instance document. "qualified" means that the element in the XML instance document must use the target namespace of the schema, "unqualified" means that the element must not use any namespace. See this StackOverflow question for details.
  • In this example (shamelessly copied from W3Schools) the default namespace is set to the target namespace. This allows to write references to elements from the target namespace without a prefix. A different, equally valid approach would be to set the default namespace to the XMLSchema namespace - in that case we would be able to write all declarations without the "xs" or "xsd" prefix. This document (PDF) has a good overview of the pros/cons of each approach.


XML document referring a schema

An XML document referring to a schema typically looks like this:

<rootElement
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation=
      "https://foo.com
       bar.xsd"
    xmlns="https://foo.com">
  [...]
</rootElement>

Discussion:

  • The namespace http://www.w3.org/2001/XMLSchema-instance is a standardized namespace. The "xsi" prefix is a convention used for this namespace. It probably is the acronym for "XML Schema Instance".
  • The "schemaLocation" attribute defines 1-n value pairs. Values are separated by whitespace. The two values in a pair have the following meaning:
    • The first value is the target namespace of the schema
    • The second value is the actual location of the schema document. It indicates where a validator can fetch the schema from.
  • The default namespace allows to write shorter, more concise document code.


If the document wants to reference a schema without a target namespace, then it must do so via the "noNamespaceSchemaLocation" attribute. The example above can be written like this:

<rootElement
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="bar.xsd">
  [...]
</rootElement>

Discussion:

  • The "noNamespaceSchemaLocation" attribute can have only a single value
  • Since the schema has no target namespace, we also don't need to define a default namespace


Simple vs. complex elements

A simple element is an element that

  • Cannot have any child elements
  • Cannot have attributes
  • Can only have a text value, but the value can be restricted to a specific data type such as "xs:integer"

Everything else is a complex element.


This is the definition of a simple element:

<xs:element name="foo" type="bar"/>


XML Schema has a lot of built-in data types. Some of the most common data types are:

  • xs:string
  • xs:decimal
  • xs:integer
  • xs:boolean
  • xs:date
  • xs:time


Simple elements can be defined with either a default value, or a fixed value, but not both:

<xs:element name="foo" type="bar" default="42"/>
<xs:element name="foo" type="bar" fixed="42"/>


Attributes

Attributes are always declared as simple types, similar to simple elements. The syntax is this:

<xs:attribute name="foo" type="bar" use="required"/>

In the example, the attribute is required. An attribute declaration without the "use" attribute defaults to the attribute being optional.


Attributes can be defined with either a default value, or a fixed value, but not both:

<xs:attribute name="foo" type="bar" default="42"/>
<xs:attribute name="foo" type="bar" fixed="42"/>


Restrictions on values

The values that elements and attributes can have can be restricted.

This example defines a value range:

<xs:element name="age">
  <xs:simpleType>
    <xs:restriction base="xs:integer">
      <xs:minInclusive value="0"/>
      <xs:maxInclusive value="120"/>
    </xs:restriction>
  </xs:simpleType>
</xs:element>

This example defines a length restriction. Note that minimum and maximum length are both optional.

<xs:element name="password">
  <xs:simpleType>
    <xs:restriction base="xs:string">
      <xs:minLength value="5"/>
      <xs:maxLength value="8"/>
    </xs:restriction>
  </xs:simpleType>
</xs:element>

This example defines a set of legal values:

<xs:element name="car">
  <xs:simpleType>
    <xs:restriction base="xs:string">
      <xs:enumeration value="Audi"/>
      <xs:enumeration value="Golf"/>
      <xs:enumeration value="BMW"/>
    </xs:restriction>
  </xs:simpleType>
</xs:element>

This example does the same thing as the previous example, but in addition defines a new data type that can be used by other elements than the "car" element.

<xs:element name="car" type="carType"/>

<xs:simpleType name="carType">
  <xs:restriction base="xs:string">
    <xs:enumeration value="Audi"/>
    <xs:enumeration value="Golf"/>
    <xs:enumeration value="BMW"/>
  </xs:restriction>
</xs:simpleType>

This example defines a regex pattern of legal values. The usual regex rules apply, maybe with the exception that groups do not need to have paranthesis (e.g. "male|female")

<xs:element name="initials">
  <xs:simpleType>
    <xs:restriction base="xs:string">
      <xs:pattern value="[A-Z][A-Z][A-Z]"/>
    </xs:restriction>
  </xs:simpleType>
</xs:element>

Whitespace restrictions can be made in one of three forms. The following example is not valid, only one of the three restrictions can appear.

<xs:element name="address">
  <xs:simpleType>
    <xs:restriction base="xs:string">
      <xs:whiteSpace value="preserve"/>   <!-- Preserve whitespace as-is -->
      <xs:whiteSpace value="replace"/>    <!-- Replace whitespace with space characters -->
      <xs:whiteSpace value="collapse"/>   <!-- Replace whitespace with space characters, trim the result, reduce multiple spaces to a single space -->
    </xs:restriction>
  </xs:simpleType>
</xs:element>


A few additional constraints:

fractionDigits
Specifies the maximum number of decimal places allowed. Must be equal to or greater than zero
maxExclusive
Specifies the upper bounds for numeric values (the value must be less than this value)
maxInclusive
Specifies the upper bounds for numeric values (the value must be less than or equal to this value)
minExclusive
Specifies the lower bounds for numeric values (the value must be greater than this value)
minInclusive
Specifies the lower bounds for numeric values (the value must be greater than or equal to this value)
totalDigits
Specifies the exact number of digits allowed. Must be greater than zero


Complex elements

A complex element can be defined in two ways:

  • By declaring a type. The type can then be reused by many different complex elements.
  • By declaring only that one complex element. The declaration in this case is unique to that single complex element and cannot be re-used elsewhere.


Example for a reusable type declaration:

<xs:element name="employee" type="personinfo"/>
<xs:element name="student" type="personinfo"/>
<xs:element name="member" type="personinfo"/>

<xs:complexType name="personinfo">
  <xs:sequence>
    <xs:element name="firstname" type="xs:string"/>
    <xs:element name="lastname" type="xs:string"/>
  </xs:sequence>
  <xs:attribute name="employeeId" type="xs:positiveInteger"/>
</xs:complexType>

Example for a non-reusable declaration:

<xs:element name="employee">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="firstname" type="xs:string"/>
      <xs:element name="lastname" type="xs:string"/>
    </xs:sequence>
    <xs:attribute name="employeeId" type="xs:positiveInteger"/>
  </xs:complexType>
</xs:element>

Discussion:

  • In the examples we specified both child elements and an attribute. Both are optional, so we could define only child elements or only attributes.
  • If both child elements and attributes are specified, the attributes must be specified after the child elements.


It is possible to extend an existing reusable type:

<xs:element name="employee" type="fullpersoninfo"/>

<xs:complexType name="personinfo">
  <xs:sequence>
    <xs:element name="firstname" type="xs:string"/>
    <xs:element name="lastname" type="xs:string"/>
  </xs:sequence>
</xs:complexType>

<xs:complexType name="fullpersoninfo">
  <xs:complexContent>
    <xs:extension base="personinfo">
      <xs:sequence>
        <xs:element name="address" type="xs:string"/>
        <xs:element name="city" type="xs:string"/>
        <xs:element name="country" type="xs:string"/>
      </xs:sequence>
    </xs:extension>
  </xs:complexContent>
</xs:complexType>


Order indicators

A sequence is used to define an ordered sequence of elements. By default every element must appear exactly once.

<xs:complexType name="personinfo">
  <xs:sequence>
    <xs:element name="firstname" type="xs:string"/>
    <xs:element name="lastname" type="xs:string"/>
  </xs:sequence>
</xs:complexType>


If we don't care about the order of elements, we can use the "all" indicator. Child elements can appear in any order. By default every element must appear exactly once.

<xs:complexType name="personinfo">
  <xs:all>
    <xs:element name="firstname" type="xs:string"/>
    <xs:element name="lastname" type="xs:string"/>
  </xs:all>
</xs:complexType>


To define alternatives we use the "choice" indicator. Only one of the child elements can appear. By default whichever element appears must appear exactly once.

<xs:complexType name="personinfo">
  <xs:choice>
    <xs:element name="firstname" type="xs:string"/>
    <xs:element name="lastname" type="xs:string"/>
  </xs:choice>
</xs:complexType>


Occurrence indicators

Occurrence indicators specify how many times an element can occur.

  • The indicators are the attributes "minOccurs" and "maxOccurs".
  • Their default value is 1, i.e. by default an element must appear exactly once.
  • To allow an unlimited number of occurrences, set "maxOccurs" to the value "unbounded".
  • In conjunction with the "all" indicator, "minOccurs" can only be set to 0 or 1, and "maxOccurs" can only be set to 1.


Example:

<xs:complexType name="personinfo">
  <xs:choice>
    <xs:element name="firstname" type="xs:string" minOccurs="1" maxOccurs="10"/>
    <xs:element name="lastname" type="xs:string"/>
  </xs:choice>
</xs:complexType>


Grouping indicators

Grouping indicators allow to declare reusable sets of elements or attributes.

Example of an element group and its use:

<xs:group name="persongroup">
  <xs:sequence>
    <xs:element name="firstname" type="xs:string"/>
    <xs:element name="lastname" type="xs:string"/>
    <xs:element name="birthday" type="xs:date"/>
  </xs:sequence>
</xs:group>

<xs:element name="person" type="personinfo"/>

<xs:complexType name="personinfo">
  <xs:sequence>
    <xs:group ref="persongroup"/>
    <xs:element name="country" type="xs:string"/>
  </xs:sequence>
</xs:complexType>


Example of an attribute group and its use:

<xs:attributeGroup name="personattrgroup">
  <xs:attribute name="firstname" type="xs:string"/>
  <xs:attribute name="lastname" type="xs:string"/>
  <xs:attribute name="birthday" type="xs:date"/>
</xs:attributeGroup>

<xs:element name="person">
  <xs:complexType>
    <xs:attributeGroup ref="personattrgroup"/>
  </xs:complexType>
</xs:element>


References

The "ref" attribute can be used to create a reference to something else. In the previous section we have seen an application of this for element and attribute groups. A few additional examples:

<xs:element name="foo">
    <xs:complexType>
        <xs:sequence>
            <xs:element ref="barElement"/>
        </xs:sequence>
        <xs:attribute ref="barAttribute" use="required"/>
    </xs:complexType>
</xs:element>

<xs:element type="xs:string" name="barElement"/>
<xs:attribute name="barAttribute"/>

Discussion:

  • The "use" attribute can only be specified where "barAttribute" is instantiated


Unions

Unions are used to create collections of simple types

<xs:element name="jeans_size">
  <xs:simpleType>
    <xs:union memberTypes="sizebyno sizebystring" />
  </xs:simpleType>
</xs:element>

<xs:simpleType name="sizebyno">
  <xs:restriction base="xs:positiveInteger">
    <xs:maxInclusive value="42"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="sizebystring">
  <xs:restriction base="xs:string">
    <xs:enumeration value="small"/>
    <xs:enumeration value="medium"/>
    <xs:enumeration value="large"/>
  </xs:restriction>
</xs:simpleType> 


TODO

  • Elements that only contain text
  • Elements that contain both text and child elements