Control Characters in XML: A conundrum

I’m working on a web service client which wraps up a bunch of character data and submits it for processing and storage.  Doesn’t sound too tricky, and for the most part, it isn’t.

Unfortunately, it appears that the application that produces the initial data allows characters that are not valid XML (i.e. the null character u0000 ).  The expectation is that the consumer of this data must have the exact characters that were created, so stripping out invalid characters is not an option.

Of course, the goal was to create a solution that would not require any special handling or decoding be done on the consumer side (and those of you that know more than I did last week about the XML character set restrictions are already seeing the problem…)

My initial stab at this was to write a simple filter that would find these characters and turn them into NCRs (Numeric Character References), so u0000  would become �  or �  if I wanted to make things match the way I usually see NCRs.  Hey, that’s a valid SGML NCR, so it should work, right?  I couldn’t imagine why all of the XmlStreamWriter  implementations I found weren’t already doing just this.  In fact, I even found a post detailing a similar solution, and he said it worked great.

This didn’t work; I kept getting parsing errors.  Maybe it was because of the serialization being done by Axiom; I changed the filter to embed the NCR in a CDATA node.  So now

this u0000 is a test u0001

became

this <![CDATA[&amp;#0;]]> is a test <![CDATA[&amp;#1;]]>

This actually seemed to work; at least it allowed the data to pass through the system as valid XML.

Unfortunately, it really wasn’t fixing anything.  When the XML was serialized into a string for storage in the db, or as part of processing, the data would become

this &amp;amp;#0; is a test &amp;amp;#1;

This is perfectly valid XML, but it doesn’t satisfy the contract; the ampersand within my NCR has been escaped, so that the original character is “double-escaped”.  With this solution, the consumer would need to review the submitted XML for instances of such “double-escaped” sequences and decode them into the original characters.

At this point, I did what I should have done last week, and checked whether XML had any Unicode character restrictions, and sure enough, it does.  Which means, bottom line, there will have to be some processing done by the consuming web service or other downstream processor.  The blog post I found must have had a more limited scope for the XML processing than what I needed.

So now I have a new question, and this is more of a design usability thing:  Which of the following control sequences would make more intuitive sense and/or be easier for the consumer to process?

  • Convert u0000  to the string “u0000”
  • Convert u0000  to the string “&amp;amp;#0;”

I think the first option looks a little cleaner for the non-XML-centric mind, but I’m wondering if the second option would be easier to process within existing code.  Fortunately, it’s not my decision to make; I’m passing this one up the project decision-making tree…

Leave a Reply

Your email address will not be published. Required fields are marked *