Inside XML

Today's Best Tech Deals

Picked by Macworld's Editors

Top Deals On Great Products

Picked by Techconnect's Editors

Staying up-to-date on the latest tech talk is difficult -- and keeping your business humming with the most-recent technology can seem nearly impossible. You must be able to publish in print and on the Web, and pretty soon you'll also need the tools to deliver information to cell phones, Internet appliances, e-books, and countless other gadgets.

The new wave of computer gizmos poses a significant challenge to creative professionals. How do you keep current and compatible with the ever changing world of technology? Although the answer isn't exactly simple, it is easy to learn, easy to understand, and easy to use. Welcome to XML.

Few people know what XML is, and chances are, even fewer know that it stands for Extensible Markup Language. But don't let what you don't know frighten you. XML promises to make Web publishing as simple as an elementary-school grammar lesson. And Macworld 's in-depth XML tutorial will show you what it's all about.

To understand XML, you must go back to its roots and look at the parent languages -- HTML and Standard Generalized Markup Language (SGML) -- that spawned this new dialect. HTML possesses several attributes that make it the perfect enabler for creating an easily accessible, global network. It is a noncentralized way of putting information into a file and guaranteeing readability across a wide variety of networks. The content is marked up with a series of tags that designate what kind of information is being read. For example, an article's headline and contents would be marked up like this:

<h1>Headline: All headlines in standard HTML are enclosed in tags like this.</h1>

<p>Paragraph: Individual paragraphs are enclosed in tags that mark each paragraph as a discrete chunk of information.</p>

HTML owes much of its success to SGML, which was founded on a generic coding concept: devise a flexible, precise, and descriptive vocabulary for expressing the contents of electronic documents. This vocabulary defined a document's structure and organization, thus making it easily readable across several different types of applications -- provided those applications could read SGML (which is still widely used by people with very complex information to organize, such as librarians or editors creating large technical documents).

SGML flourished because it gave documents unprecedented portability. The language's downside was its complexity, and HTML was created in part to provide a quick and easy tool for accomplishing portability across networks.

However, HTML lacks two characteristics that Web developers and users demand: easily indexable data structures and customizable appearances. Because HTML provided only a very basic document structure, people who wanted to mark up data in a way that would reflect the underlying organization of their content were out of luck. Site creators who wanted to control the appearance of content were even unluckier.

Early in the Web-development game, HTML's document-structure tags such as <p> and <body> got tied up with document-appearance attributes such as <font color>. As a result, many Web sites became elaborate nests of tables, font tags, and images. Web designers often succeeded in specifying how a site should look, but they did so at the expense of a document's structure. The end result was a lot of work for the people charged with maintaining the site.

For example, a Web-site designer might put all headlines in a big, bright-red Arial font. Any reader who sees this using the correct browser may deduce that any instance of big, red text is a headline. But from a markup-language grammar perspective, the document is disorganized. There is no way to tell which tags are meant to identify the headline. XML solves that problem by providing a data-encoding method that can be easily defined and exchanged across several different computers.

TML falls short in two critical areas: organizing information into categories or hierarchies is difficult, and HTML doesn't let you easily control the appearance of items on your page. When it comes to organizing information, HTML coders rely on a series of headings -- headline sizes that can denote a hierarchy: <h1> through <h4>.

The headings work well if you're organizing a document according to a strict outline, but not all types of information fit neatly into this model. For example, if you wanted to mark up a document about swimming, you might have the following groups of information: types of swimming strokes, distance of different races, and composition of races by strokes.

In HTML, there's no neat way to indicate that these different groups of information are related but not nested within each other. In XML, however, you can write a Swimming Markup Language and set up elements such as:

 &nsp; <stroke> </stroke>

 &nsp; <race> </race>

You can also set up subcategories within each to indicate different kinds of strokes and races, for example:

 &nsp; <stroke>

 &nsp;  &nsp; <butterfly> </butterfly>

 &nsp;  &nsp; <breast> </breast>

 &nsp;  &nsp; <back> </back>

 &nsp;  &nsp; <freestyle> </freestyle>

 &nsp; </stroke>


 &nsp; <race>

 &nsp;  &nsp; <100> </100>

 &nsp;  &nsp; <200> </200>

 &nsp; </race>

Best of all, you can then group the different elements together to create more-complex data:

 &nsp; <race="IM">

 &nsp;  &nsp; <stroke>

 &nsp;  &nsp;  &nsp; <butterfly> </butterfly>

 &nsp;  &nsp;  &nsp; <breast> </breast>

 &nsp;  &nsp;  &nsp; <back> </back>

 &nsp;  &nsp;  &nsp; <freestyle> </freestyle>

 &nsp;  &nsp; </stroke>

 &nsp; </race>

This would indicate all the different strokes that make up the IM race. Organizing data like this in HTML would be very difficult; there would be no way for you to draw distinctions between the information.

A further drawback to HTML's system of headings and text is the lack of control you have over your site's appearance. Although you can write stylesheets to specify how different HTML elements such as <h3> or <p> look, you can't easily attach a specific appearance to recurring data.

For example, you might decide that all instances of a site's name must appear in blue. In HTML, there's no easy way to do this -- you would have to search for the word AcmeCo and attach tags such as <font color="blue">AcmeCo </font> to each instance.

XML lets you create a tag called <company> </company> and use it to enclose every instance of AcmeCo . To change the appearance of AcmeCo , you simply write a stylesheet to control the appearance of whatever appears within <company> </company>. This way, you have to change only one line in the stylesheet.

Before you try your hand at XML, you'll need to familiarize yourself with its lingo. The way a browser interprets a markup language is similar to the way a person understands a spoken or written language. Just as a person learns grammar and vocabulary to interpret strings of words, a browser has the ability to understand a file's set of rules and vocabulary. With markup language, the companies that make the browsers determine what a browser will read and understand.

Fortunately, these companies don't have to invent the grammar and vocabulary their browsers will understand -- the World Wide Web Consortium (W3C) has already done that. The W3C is the closest thing the Web has to a governing body. It decides on the technical protocols that computers connected to the Net must be able to recognize and implement. The W3C's recommendations outline what the grammar and vocabulary that compose a markup language should be.

If a browser's going to recognize an XML document, it must understand four W3C recommendations, which work in concert to render XML pages. If you try to write XML without meeting all of these recommendations, your code won't work.

XML   The grammar for markup languages, XML is the general guideline to follow when writing different types of languages that will organize and present your site content.

XML Linking Language (XLink)   Hyperlinking is the core of what makes the Web work. XLink is a W3C recommendation that outlines hyperlink behavior. In other words, it tells the browser what it should do when it encounters a hyperlink in a document.

XML Extended Pointer Notation (XPointer)   If XLink dictates how links behave, XPointer specifies what information those links contain. Hyperlinks now simply point to a document address; they don't actually contain any information. XPointer supplies a way to add highly specific information about the role specific links play relative to the rest of the content in a document. For example, links in a navigation bar can now specify which page they should point to, in addition to specific addresses. This means you can more easily track hyperlinks and change them across an entire site: instead of changing a specific hyperlink address on several thousand pages, you can change it once, and the other links will all redirect based on the altered information.

It's Greek to Me

Even though the terms are in English, you may not be able to make sense of all the XML jargon. Our glossary will help you separate the Greek from the geek.
data The two different types of information that describe an entity -- character and markup. Character data explains the content in an entity, and markup data describes the logical structure (i.e. where it fits) of the entity.
Document-Type Definition (DTD) A file containing the formal definitions that will describe the content structure and attributes within a document. A DTD dictates what names will be used for different tags (also known as elements), how frequently elements can occur, and how assorted elements fit together.
DTDless A file created without a DTD. Because writing a DTD is often a complex and time-consuming task, XML can also work without a DTD.
element A building block for a markup language's structural organization and content. For example, <link> would be an element specifying the relationship among different documents.
entity A widely used term, with several definitions. Entities are usually coded into a document's DTD. They can perform repeated tasks such as setting up nicknames for frequently referenced data. Instead of typing Go to every time, you could create an entity such as <!ENTITY mw "Go to"> .
Standalone Document Declaration (SDD) If your XML document doesn't have a DTD, it must tell the application reading it that it's an SDD. To alert the XML-reading application to the document's DTDless state, the XML document in question should include an SDD.
Standard Generalized Markup Language (SGML) The international standard for setting the descriptive rules of electronic documents' structure and content. SGML spawned both XML and HTML. Just as different dialects alter textbook English, these markup languages are considered variants.
XSL Extensible Stylesheet Language is an XML-based language devoted to specifying a visual style for the items in an XML document.

Extensible Style Language (XSL)   XML documents are controlled by a stylesheet; the only question is which type of stylesheet to use. The newcomer in the stylesheet market is Extensible Style Language, an XML-based recommendation drafted specifically for XML documents. The biggest difference between XSL and CSS (Cascading Style Sheets--for more information, see "Reconcilable Differences," How-to , September 2000) lies in the language used to write the stylesheets: XSL is XML-based, and CSS is not.

XLink, XPointer, and XSL are written using XML syntax, and all three different protocols help to build a typical XML Web page: XSL might determine a page's appearance; XLink and XPointer determine what the links will do on the page. The page's content will be organized and marked up using an XML Document Type Definition (DTD).

The four protocols are the building blocks -- what makes them all tick is the DTD. If XML is like markup-language grammar, the DTD acts as the markup language's dictionary and style guide. A DTD defines the terms of an XML-based document, the specific details each term has, and the relationships the terms have to one another. The DTD excerpt below identifies common elements in a markup language designed to format Shakespeare's plays.





The elements identified are: speech, to be used when a character is making a speech; speaker, to be used to identify characters with speaking roles; line, to designate each line in a speech; and stagedir, which dictates the directions that accompany the speech. You can see from the example that elements can nest inside each other.

Although one of the biggest advantages to developing a site in XML is having the ability to set up your own logical data structure via a customer-built DTD, another advantage is being able to use a different, standard DTD. These DTDs can be specific to industries (imagine a group of markup tags devised especially for accountants) or other, already established means of organizing content. For example, Jon Bosak has written DTDs for Shakespeare plays, thus providing a grammar for denoting characters and their lines (

Find Out More About XML

Want to take your XML knowledge to the next level? Start with these informative Web resources.
Name URL What It Does Anyone interested in staying up-to-date on the latest XML developments should save this url. Sporting everything from product news to beginning tutorials, this site also boasts columns and how-tos from XML gurus.
XML Software If you're itching to try XML on your own, it helps to have browsers and tools that can render and convert it. Check out the complete collection of XML-centric applications.
The XML Cover Pages For XML news -- from tracking the progress of the latest XML-related W3C recommendations to newly launched XML sites -- this site has it all.

The DTD is crucial for providing the "rules" in an XML document or on an XML site. Since we're still in an HTML world, most Web documents have HTML DTDs -- if they have DTDs at all -- and a general syntax to organize markup tags. How, then, will you move the Web pages you have from one type of markup language to another?

Changing a Web site's markup language from HTML to the more advanced XML is an evolutionary step. Extended HTML (XHTML) mixes HTML's limited vocabulary with XML's data-organizing capabilities. To make the transition, you must follow a number of simple format rules:

All markup must be in lower-case tags:

<h1> My page title </h1>

All attributes must be in quotes:

<body bgcolor="#FFFFFF">

All elements must have opening and closing tags:

<li>list item </li>

All documents must have a DTD. XHTML authors can choose from three different XHTML DTDs, all of which are hosted on the W3C's Web site ( ).

To attach a DTD to your document, you need a statement at its beginning called a doctype declaration, which says what sort of DTD the document uses and where the DTD lives.

For example, to include a strict doctype declaration (which assumes you're using strict XHTML) in your document, put these lines at the very top:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" " strict.dtd">

The transitional doctype declaration is the most flexible: you use it if you're trying to ensure that people using non-CSS enabled browsers can see your site. If you're still using tables to lay out your Web site, you'll want to use this one:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">

The frameset declaration is what you use if you're writing a document with frames. Its syntax is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "">

Once you've converted your site to XHTML, you can use an XML DTD to refine the way your site's content is organized through defined markup tags. Novice DTD writers can also check out to find a well-sorted directory of DTDs designed for everything from advertising to ontology to travel. You can also check out a list of Web sites that have integrated XML into their content at XMLTree ( ).

Although it's tempting to begin plotting your site's conversion from HTML to XML, it's easy to get bogged down in practicalities. Developing a specific markup language is only part of the process; you must also figure out how to map your current HTML content to a more structured XML markup. In addition, there will be a learning curve: many Web developers are quite familiar with HTML because they've been using it frequently and for a long time. Acquiring the same familiarity with XML -- and the tools you can use to develop Web sites in it -- will require practice. Within a year, however, building XML-based Web sites based on different DTDs or XML-based markup languages should be simple. And in the end, getting your content out to all the people who want to see it may be that much easier.

LISA SCHMEISER is's senior editor.

1 2 Page 1
Page 1 of 2
Shop Tech Products at Amazon