XML encoding (utf-8, ascii)
XML is a markup language similar to HTML. It was designed to transport data. Once data has been enoded, it can be easily read by many different systems. As a result, it is widely used in web services to transfer data.
Recently, I was working on a web service which required us to parse the data from the XML feed and to store it in the database. Normally this is a simple task which can be achieved by using PHP’s simple_xml library to parse the data. However, if the document has not been encoded properly, simple_xml will generate an XML.
Whenever an XML document is encoded, the encoding used should be provided in the document.
If the document was encoded with unicode, example, UTF-8, the following would be the first line of the:
<?xml version=”1.0″ encoding=”UTF-8″?>
In my case, the xml document was label as UTF-8, however it was an ascii document which contained non-acii characters. This created a major problem with the parser. The quick solution is to strip the non-ascii characters from the document.
This can be achieved with the following php code:
$jobs = file_get_contents(‘/home/mydir/doc.xml’);
$jobs = preg_replace(‘/[^(\x20-\x7F)]*/’,”, $jobs);