12.9 Normalizing an XML Document
Credit: David Ascher, Paul Prescod
12.9.1 Problem
You want to compare two
different XML documents using standard tools such as
diff.
12.9.2 Solution
Normalize each XML document using the following recipe, then use a
whitespace-insensitive diff tool:
from xml.dom import minidom
dom = minidom.parse(input)
dom.writexml(open(outputfname, "w"))
12.9.3 Discussion
Different editing tools munge XML
differently. Some, like text editors, make no modification that is
not explicitly done by the user. Others, such as XML-specific
editors, sometimes change the order of attributes or automatically
indent elements to facilitate the reading of raw XML. There are
reasons for each approach, but unfortunately, the two approaches can
lead to confusing differences—for example, if one author uses a
plain editor while another uses a fancy XML editor, and a third
person is in charge of merging the two sets of changes. In such
cases, one should use an XML-difference engine. Typically, however,
such tools are not easy to come by. Most are written in Java and
don't deal well with large XML documents (performing
tree-diffs efficiently is a hard problem!).
Luckily, combinations of small steps can solve the problem nicely.
First, normalize each XML document, then use a standard line-oriented
diff tool to compare the normalized outputs.
This recipe is a simple XML normalizer. All it does is parse the XML
into a Document Object Model (DOM) and write it out. In the process,
elements with no children are written in the more compact form
(<foo/> rather than
<foo></foo>), and attributes are
sorted lexicographically.
The second stage is easily done by using some options to the standard
diff, such as the -w option,
which ignores whitespace differences. Or you might want to use
Python's standard module
difflib, which by default also ignores spaces
and tabs, and has the advantage of being available on all platforms
since Python 2.1.
There's a slight problem that shows up if you use
this recipe unaltered. The standard way in which
minidom outputs XML escapes
quotation marks results in all " inside of
elements appearing as
". This
won't make a difference to smart XML editors, but
it's not a nice thing to do for people reading the
output with vi or emacs.
Luckily, fixing minidom from the outside
isn't hard:
def _write_data(writer, data):
"Writes datachars to writer."
replace = _string.replace
data = replace(data, "&", "&")
data = replace(data, "<", "<")
data = replace(data, ">", ">")
writer.write(data)
def my_writexml(self, writer, indent="", addindent="", newl=""):
_write_data(writer, "%s%s%s" % (indent, self.data, newl))
minidom.Text.writexml = my_writexml
Here, we substitute the writexml method for
Text nodes with a version that calls a new
_write_data function
identical to the one in minidom, except that the
escaping of quotation marks is skipped. Naturally, the preceding
should be done before the call to minidom.parse to
be effective.
12.9.4 See Also
Documentation for minidom is part of the XML
documentation in the Standard Library reference.
|