XML encoding (utf-8, ascii)

July 31st, 2011

XML is a markup language similar to HTML. It was designed to transport data. Once data has been enoded, it can be easily read by many different systems. As a result, it is widely used in web services to transfer data.

Recently, I was working on a web service which required us to parse the data from the XML feed and to store it in the database. Normally this is a simple task which can be achieved by using PHP’s simple_xml library to parse the data. However, if the document has not been encoded properly, simple_xml will generate an XML.

Whenever an XML document is encoded, the encoding used should be provided in the document.
If the document was encoded with unicode, example, UTF-8, the following would be the first line of the:
<?xml version=”1.0″ encoding=”UTF-8″?>

In my case, the xml document was label as UTF-8, however it was an ascii document which contained non-acii characters. This created a major problem with the parser. The quick solution is to strip the non-ascii characters from the document.
This can be achieved with the following php code:

$jobs = file_get_contents(‘/home/mydir/doc.xml’);
$jobs = preg_replace(‘/[^(\x20-\x7F)]*/’,”, $jobs);

Ryan Wright

Ryan is a PHP/MySQL Developer. As a High School intern, he worked on applications for NASA's bird migration project at City College of New York where he learned the more intricate details of software development. After studying Computer Engineering at Polytechnic University, Ryan has been working on developing numerous web applications ranging from simple sites to more advance E-Commerce solutions and Social Networking sites.

One Response to “XML encoding (utf-8, ascii)”

  1. I was just looking for this info for some time. After 6 hours of continuous Googleing, finally I got it in your web site. I wonder what’s the Google’s problem that does not rank this type of informative web sites closer to the top. Normally the top sites are full of garbage.

Leave a Reply