English 中文(简体)
MS Word splits words in its XML format
原标题:

I have a Word 2003 document saved as a XML in WordProcessingML format. It contains few placeholders which will be dynamically replaced by an appropriate content. But, the problem is that Word seemingly randomly splits them in the separate words. For example, instead of this:

<w:t>${dl.d.out.ecs_rev}</w:t>

I have this:

...
<w:t>${</w:t>
 </w:r>
 <w:r wsp:rsidR="005D11C0">
  <w:rPr>
   <w:sz w:val="20" />
   <w:sz-cs w:val="20" />
  </w:rPr>
  <w:t>dl.</w:t>
 </w:r>
<w:r wsp:rsidRPr="00696324">
 <w:rPr>
  <w:sz w:val="20" />
  <w:sz-cs w:val="20" />
 </w:rPr>
<w:t>d.out.ecs_rev}</w:t>
...

Is there any way to save a "clean" XML document using Word 2003, or is there any existing solution which can do the cleaning?

I tried to program a method in Java which will concatenate separated parts of the placeholders, but because the number of different cutting combinations is relatively big, the algorithm for that is far more complex than a original task that I have to do, so it is problem for itself.

最佳回答
问题回答

If you have control over the original Word documents, you can stop Word from inserting rsid and highlighting grammar/spelling errors.

         Word.Options opts = Word.Options;
            opts.CheckGrammarAsYouType = false;
            opts.CheckGrammarWithSpelling = false;
            opts.CheckSpellingAsYouType = false;
            opts.StoreRSIDOnSave = false;

Words will still get split, if for example you change font part way through the word.

Hmmm, I have a simple+ugly bit of xslt which I ve used to clean WordML like the example you posted. I could commit it to docx4j if you want it, but as you say, there are various combinations which wouldn t be covered. Anyway, if you want it, please post to the docx4j forum.

A more robust approach would be to extract the plain text, and relate the plain text to the XML, so you can search the plain text, and go from there to the XML.

Word 2003 XML is unusually complex and hard to decode. The reason you are getting multiple tags is because Word ML generates tags called runs (the w:r tag). As far as I know, there is no easy way to do the clean the XML above. I would recommend using HTML instead of WordML. It is way easier to manipulate and replace your placeholders with appropriate content. If cost is not an objective, use a product like Aspose. It does everything for you and is simple to use.





相关问题
how to represent it in dtd?

I have two element action and guid. guid is a required field when action is add. but when action is del it will not appear in file. How to represent this in dtd ?

.Net application configuration add xml-data

I need to add xml-content to my application configuration file. Is there a way to add it directly to the appSettings section or do I need to implement a configSection? Is it possible to add the xml ...

XStream serializing collections

I have a class structure that I would like to serialize with Xstream. The root class contains a collection of other objects (of varying types). I would like to only serialize part of the objects that ...

MS Word splits words in its XML format

I have a Word 2003 document saved as a XML in WordProcessingML format. It contains few placeholders which will be dynamically replaced by an appropriate content. But, the problem is that Word ...

Merging an XML file with a list of changes

I have two XML files that are generated by another application I have no control over. The first is a settings file, and the second is a list of changes that should be applied to the first. Main ...

How do I check if a node has no siblings?

I have a org.w3c.dom.Node object. I would like to see if it has any other siblings. Here s what I have tried: Node sibling = node.getNextSibling(); if(sibling == null) return true; else ...

Ordering a hash to xml: Rails

I m building an xml document from a hash. The xml attributes need to be in order. How can this be accomplished? hash.to_xml

热门标签