Parsing XML With PHP

Photo of author

By Miro Stoichev

We have presented a simple framework (named XMLCast) for distributing content to a variety of devices using XML. This application was built using Microsoft’s Active Server Pages (ASP) technology but we realize that many of you aren’t using ASP (we aren’t either). This article will present the concept of XML parsing using the PHP scripting language. In the coming weeks, we will follow this example up with an expansion of XMLCast using other tools such as XSLT and Cocoon.

Recently, The Wireless Developer Network began offering our daily news in a variety of formats for people that wanted their news delivered in ways other than standard HTML. Among the formats we offer is Rich Site Summary (RSS), an XML format that splits up items (news headlines, in this case) into easily extractable elements, allowing other sites to grab our latest news headlines, format them as they wish, and list them on a page on their site, all with the convinence of XML data exchange.

More about Rich Site Summary

RSS version 0.91 was developed by Netscape for their “My Netscape Network” and it allows a site to create an XML file that contains basic information about the site, in addition to “items” which can have “title”, “link” and “description” nodes. That’s great, you say, but now that we’ve got the RSS XML document, how do we extract the information and serve it up as HTML? Well, each language has it’s own way to deal with XML, and for this example, we’re using PHP and it’s included XML parser. PHP uses James Clark’s expat library, which you already have if you are using Apache 1.3.9 or later. To parse XML with PHP, you must configure PHP with the –with-xml argument prior to make and make install.

We’ve written a simple PHP script that parses the RSS file, extracts the pertinent information, formats it, and serves it up as regular HTML. Not only does it give an example of how to parse an RSS XML file with PHP, this script can also be added to any PHP file, allowing for automatically updated news headlines straight from our site.

The first thing we do is create a class to hold our headlines: class xItem { var $xTitle; var $xLink; var $xDescription; } Then, we define a few global variables for the general site information, and an array to hold the headline objects. $sTitle = “”; $sLink = “”; $sDescription = “”; $arItems = array(); $itemCount = 0; The meat of the XML parsing is in the next three functions, startElement, endElement, and characterData. We’ve used a nice trick by David Medinets from his book PHP3 – Programming Browser-Based Applications for extracting the XML data in PHP. With PHP’s implementation of XML, there’s no easy way to get around using global variables, but David’s way is one of the most straightforward PHP-XML implementations we’ve found. Here’s the first two functions: function startElement($parser, $name, $attrs) { global $curTag; $curTag .= “^$name”; } function endElement($parser, $name) { global $curTag; $caret_pos = strrpos($curTag,’^’); $curTag = substr($curTag,0,$caret_pos); } To parse PHP in XML, you define functions to handle:

a) when the parser encounters the start element of a tag
b) when the parser encounters the end element of a tag
c) when the parser encounters the data within the start and end tags

The way we handle these functions is by setting a global variable ($curTag) to a string containg all the parent tags separated by a caret (^). For example, an xml structure that looks like: &lt;rss&gt; &lt;channel&gt; &lt;item&gt; &lt;/item&gt; &lt;/channel&gt; &lt;/rss&gt; would translate to a $curTag: ^RSS^CHANNEL^ITEM when the parser has found the <ITEM> tag. All we have to do is check for when the parser has found the correct $curTag, and extract the data accordingly. That’s all done in the characterData function. Here it is: function characterData($parser, $data) { global $curTag; // get the Channel information first global $sTitle, $sLink, $sDescription; $titleKey = “^RSS^CHANNEL^TITLE”; $linkKey = “^RSS^CHANNEL^LINK”; $descKey = “^RSS^CHANNEL^DESCRIPTION”; if ($curTag == $titleKey) { $sTitle = $data; } elseif ($curTag == $linkKey) { $sLink = $data; } elseif ($curTag == $descKey) { $sDescription = $data; } // now get the items global $arItems, $itemCount; $itemTitleKey = “^RSS^CHANNEL^ITEM^TITLE”; $itemLinkKey = “^RSS^CHANNEL^ITEM^LINK”; $itemDescKey = “^RSS^CHANNEL^ITEM^DESCRIPTION”; if ($curTag == $itemTitleKey) { // make new xItem $arItems[$itemCount] = new xItem(); // set new item object’s properties $arItems[$itemCount]-&amp;gt;xTitle = $data; } elseif ($curTag == $itemLinkKey) { $arItems[$itemCount]-&amp;gt;xLink = $data; } elseif ($curTag == $itemDescKey) { $arItems[$itemCount]-&amp;gt;xDescription = $data; // increment item counter $itemCount++; } } The characterData function checks if the $curTag is something we want to extract, and if it is, assign it to our variables. The first chunk extracts the general information about the site, and then checks if we’ve come across an <ITEM>. If we have, it creates a new xItem, inserts it into our $arItems array, and sets the properties to the appropriate data from the RSS file.

PHP’s standard

Now that the functions are defined, we use PHP’s standard way of assigning our functions to the XML parser: // main loop $xml_parser = xml_parser_create(); xml_set_element_handler($xml_parser, “startElement”, “endElement”); xml_set_character_data_handler($xml_parser, “characterData”); if (!($fp = fopen($uFile,”r”))) { die (“could not open RSS for input”); } while ($data = fread($fp, 4096)) { if (!xml_parse($xml_parser, $data, feof($fp))) { die(sprintf(“XML error: %s at line %d”, xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser))); } } xml_parser_free($xml_parser); Everything in the above code that starts with “xml_” in the above code is standard PHP XML functions. We tell PHP’s XML parser we want our functions to execute when the parser comes accross a start tag, end tag, or \ character data, and then we load the RSS file ($uFile, set to our RSS document), and start up the parser (xml_parse).

Now that we have the data in nice little objects and variables, formatting it and serving it up is simple: &lt;html&gt; &lt;head&gt; &lt;title&gt;&lt;/title&gt; &lt;meta name=”description” content=””&gt; &lt;/head&gt; &lt;body bgcolor=”#FFFFFF”&gt; &lt;font face=”” size=””&gt; &lt;a href=””&gt;&lt;/a&gt; &lt;/font&gt; &lt;br&gt; &lt;br&gt; &lt;/body&gt; &lt;/html&gt; We’ve added a few user-defined variables to set the font, font size, and whether or not you want the descriptions along with the headlines (see the source code for details), but basically the above code loops through our array of items, echo-ing out them in a basic format.

When it comes to exchanging data, XML is hard to beat. Defining an XML format that can be used by many people (like RSS) is just one of the benefits from using this sophisticated, yet elegant, technology. Parsing XML in PHP may not be quite so straightforward at first, but once you get a handle on it, the possibilities of exchanging data (especially over something like the Internet) are endless.

Leave a Comment