The XML specification lists a bunch of Unicode characters that are either illegal or "discouraged". Given a string, how can I remove all illegal characters from it?
I came up with the following regular expression, but it s a bit of a mouthful.
illegal_xml_re = re.compile(u [x00-x08x0b-x1fx7f-x84x86-x9fud800-udfffufdd0-ufddfufffe-uffff] )
clean = illegal_xml_re.sub( , dirty)
(Python 2.5 doesn t know about Unicode chars above 0xFFFF, so no need to filter those.)