In my last column I referred you to XHTML as a way to transition from an HTML-based website to an XML-based website. Since XHTML is the future of HTML, I thought it was worth a closer look.
XHTML Defined
XHTML is an XML encoding of HTML 4. XHTML is defined by the World Wide Web Consortium (W3C for short), which also defines the various XML and HTML standards. XHTML Basic is a subset of XHTML geared specifically for small devices that can’t support the full functionality of XHTML — more on that later.
Why was XHTML developed? Although both XML and HTML are derivatives of SGML — Standardized General Markup Language — there are important differences between the two markup languages. HTML is quite forgiving when it comes to tag placement, missing tags, and attribute values. A web browser, for example, will correctly interpret an HTML page even if you leave out the <HTML> and </HTML> tags. Nor do they require that you place the </P> tag to mark the end of a paragraph. XML, however, has strict rules about the format and use of tags. These rules make it easier to parse XML documents. And more importantly, different XML parsers will interpret a given XML document in the same way. HTML parsers are not as consistent because the rules aren’t as strict, which leads to differences in the way web browsers display pages.
From HTML to XHTML
XHTML is the bridge between HTML and XML. It imposes XML’s strict syntax on HTML and defines XML document type definitions (DTDs) covering HTML 4.01. Take, for example, this simple HTML document:
<HTML> <HEAD><TITLE>Hello World!</TITLE></HEAD> <BODY> Hello world! <HR> Hello again! </BODY> </HTML>
When converted to XHTML it looks like this:<?xml version=”1.0″ encoding=”UTF-8″?> <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN” “DTD/xhtml1-strict.dtd”> <html xmlns=”http://www.w3.org/1999/xhtml” xml:lang=”en” lang=”en”> <head><title>Hello World!</title></head> <body> <p>Hello world!</p> <hr/> <p>Hello again!</p> </body> </html>
Not too different, is it? And to convert your HTML to XHTML there’s a tool called HTML Tidy that does the job very well.
But of course, today’s browsers don’t understand XHTML. For example, XML constructs like the shorthand notation for empty elements — where instead of using a start tag and an end tag (like <abc> and </abc>) you use a single tag ending in a slash (like <abc/>) — will be ignored or misinterpreted by web browsers.
We can take advantage of a web browser’s forgiving nature, however, and adapt our XHTML so that it can be read correctly by a web browser while still maintaining correct XML syntax. Appendix C of the XHTML specification shows you how to do this. For example, place a space between the tag and the trailing slash when specifying an empty element. In other words, use <hr /> instead of <hr>.
XHTML Basic
>From a small device perspective, however, XHTML is too large to support, because HTML 4 is itself very large. In order to encourage the adoption of XHTML for all devices, the W3C is currently defining a standard called XHTML Basic. It defines a standardized minimal subset of XHTML aimed specifically at “small information appliances”. It’s really a pre-emptive strike by the W3C to avoid the fragmentation of XHTML by third parties, which is what happened with HTML.
Does WML Disappear?
There will inevitably be a push to replace languages like HDML and WML with XHTML Basic. This only makes sense since the industry is moving towards adopting XML-based formats for almost every kind of data interchange. In fact, since WML is itself an XML language the transition to XHTML doesn’t seem that hard at first glance. However, you have to remember that WML (and HDML) also defines actions as well as content. These currently have no equivalent in XHTML. So, in the short term at least, WML and HDML aren’t going to disappear. It will be interesting to see who wins out in the end, though. Plan on supporting all three markup languages at some point!