An Introduction To VoiceXML

Photo of author

By Miro Stoichev


VoiceXML is the name of a technology standard developed and managed by the VoiceXML Forum ( It builds upon the work of earlier technologies such as VoXML from Motorola and SpeechML from IBM to create a standardized way to interact with services through a voice interface. Not surprisingly, this technology is XML-based. The VoiceXML Forum was founded by AT&T;, Lucent Technologies, Motorola, and IBM with the “aim to drive the market for voice- and phone-enabled Internet access by promoting a standard specification for VXML, a computer language used to create Web content and services that can be accessed by phone.” The 1.0 Release of the VoiceXML Specification was presented to the W3C for approval in March, 2000 and can be downloaded from the VoiceXML Forum website.

Why Voice?

Perhaps the first question that may arise is: “Why do we need a markup language for voice commands?” The answer to that question is becoming increasingly obvious as some members of the technology community have expressed their displeasure with textual wireless interfaces. Wireless communication devices have the disadvantage of having small screens, limited input capabilities, and limited processing power. They’ve obviously been huge successes as voice communication conduits however it remains to be seen how the public will accept them as data delivery vehicles. One alternative to the textual interface offered by technologies such as WAP is what was originally known as an IVR, or Interactive Voice Response, system. Historically, these systems have been very proprietary and therefore unsuitable for allowing access to Web-based content. VoiceXML basically allows you to define a “tree” that steps the user through a selection process – known as voice dialogs. The user interacts with these voice dialogs through the oldest interface known to mankind: the voice! Powerful speech recognition software resides on the server to convert the user’s stated selection (i.e. “Yes” or “No”) into textual selection. This process is akin to selecting a hyperlink on a traditional Web page. Dialog selections result in the playback of audio response files (either prerecorded or dynamically generated using some sort of server-side text-to-speech conversion).

From a business viewpoint, voice applications open up a host of new revenue opportunities. Perhaps the most obvious revenue opportunity comes from the increased number of minutes we will all be spending on our wireless phones. In addition, advertising will become as commonplace through these services as it currently is on traditional media (Web, TV, radio, etc.). As voice services are added to your traditional carrier plan, there will clearly be a market for pay-as-you-go premium services (information lookups, email, contact databases, etc.). It’s not hard to imagine most consumers opting to listen to a 15-second ad in exchange for free access to these premium services! Because VoiceXML is XML-based, it is yet another technology driving the move towards content distribution and management in XML. Within two years, it is very likely that content providers will offer both WAP- and Voice-accessible sites for their wireless customers. Clearly, by this point, a manageable architecture using XML will be required.

Developing for VoiceXML

We’ll be providing a complete tutorial for VoiceXML in the near future, but we’ll begin to take a look at the technology in this section. As mentioned earlier, VoiceXML is an XML application that defines a tree-like structure that the user can traverse through using voice commands. An integral component to every VoiceXML application is the text-to-speech and speech-to-text processing engine that runs on the server. These products are available from a variety of vendors including IBM and Motorola. Readers familiar with WML will find themselves vaguely familiar with VoiceXML as well. This is because both are XML-based markup languages used to define a group of elements that enable a user to traverse information. All VoiceXML “documents” begin and end with the<vxml>tag. Before diving in, we’d be remiss if we didn’t at least kick things off with a simple “Hello World!” example.

<?xml version=’1.0′?> <vxml version=”1.0″> <form> <block>Hello World!</block> </form> </vxml>

As you can see, this is a very simple example that uses theelement to present some text to the user. VoiceXML defines two types of dialogs that the application uses to interface with the user: forms and menus. A form is simply used to present information to a user or to retrieve information from the user. A menu is a specialized form that forces the user to choose a specific option and then branch based on the option that was chosen. The element generally includes a directive that tells the application where to jump based on the user’s input. A menu uses multiple elements to define where to transition to based on the user’s selection. The following example prompts the user to select from a variety of options and calls a different VoiceXML choice based on the selection option.

<?xml version=’1.0′?> <vxml version=”1.0″> <menu id=”Simple_Example”> <prompt> <audio>Welecome to my voice mail. Say Mail to leave me a voice mail and Operator to return back to our operator.</audio> </prompt> <choice next=”voicemail.vxml”> Mail </choice> <choice next=”operator.vxml”> Operator </choice> <default><reprompt/></default> </menu> </vxml>

As you can see, the choice options within the menu call additional VoiceXML “routines” that define further tasks. The element handles every response that’s not either a “Mail” or an “Operator” command. In this case, it includes a command that simply orders the server to replay the items in the prompt list.

By no means is this a comprehensive tutorial but it should give you a feel for the way in which voice commands are processed and defined on the server. Obviously, this sort of application would have to factor in scalability and server loads given the processing power required to continually do speech-to-text conversion. In addition, traditional server development tasks (database/messaging system access, spatial queries, etc.) would need to be developed to interface with prompt choices.

Potential Applications

The potential applications of this technology abound. Currently, voice “portals” are springing up to offer voice access to stock quotes, movie and restaurant listings, and daily news. Just as traditional Web portals redefined themselves over time into personalized services, these voice portals will eventually offer access to your email (to both check and send emails using your voice!), . Ironically, voice technologies are over a century ahead of the Web in one area: instant messaging (isn’t that what Alexander Graham Bell’s original phone call basically amounted to?!?).

These voice-related technologies will also be among the first location-based services to appear on the market because of the mobile nature of the end user. For instance, currently the TellMe portal can automatically retrieve movie listings for the nearest theater based on the number you are calling from. For the majority of location-based applications, this type of service is accurate enough. In the future, one could imagine integration with the FCC-mandated E911 positioning or even a GPS in the handset for driving directions that actually talk you through the twists and turns required to reach a destination.

The Business Case


Separate from the discussion of VoiceXML is a look at the benefits of voice processing technologies in general. Despite the advent of technologies such as WAP, the fact remains that accessing textual content over a small phone display is difficult and, in some applications, rather unnatural. When adding in any amount of data entry over the phone, it quickly becomes an impractical interface. Voice technologies, on the other hand, take advantage of the very interface that phones were designed to server and will undoubtedly be accepted more readily by the general public. VoiceXML, specifically, is a well-structured, uniform way to build logic trees that customers can use to access the information of interest to them. Look for tools and services based on VoiceXML to become increasingly popular in the near future.


Perhaps the biggest disadvantage of voice-based technologies is the rigid structure that they impose on the end user. While a textual interface (i.e. WML) can support popular tools such as search engines and online browsing of catalogs or information, voice technologies are much better at delivering a specific pinpointed bit of information to an end user (i.e. a stock quote, a movie time, a restaurant location, etc.). In addition, while convenient, using a voice portal such as TellMe can be maddeningly slow when forced to drill through several layers of options before finding exactly what you want. One interesting combination of the textual interface and the voice interface is the tool known as a “voice browser” (for a demo of a popular voice browser, visit Conversa). A voice browser allows the user to “speak” links to quickly traverse through textual content, which may be a great compromise…particularly in automobile- or hands-free type of applications.


Before making a decision on voice access for your wireless application, take the time to visit a voice portal such. While much of the wireless world is spending time trying to figure out how to deliver advertising as well as content to mobile devices, these voice portals are successfully doing both of these things now…with location-based services thrown in for good measure! Having said that, unlike the wired world, it is becoming increasingly clear that wireless application delivery will require support for a wide variety of devices and access technologies which will include voice access via VoiceXML to your applications.

Leave a Comment