Monday, June 14, 2010

W3 Document.textContent vs. MSXML Document.text and MSDN docs

w3 says that .textContent for DOCUMENT_NODE should be null.
Closest MS implementation (MSXML .text) documented in MSDN claims:
NODE_DOCUMENT
Returns a string representing the value of the node.
This is the concatenated text of
all subnodes with entities expanded.
But what is subnodes and what is text ?
<?xml version="1.0" encoding="utf-8" ?>
<!-- Document level comment -->
<!-- TODO: NOTATION -->
<!DOCTYPE root [
    <!ENTITY ent1 "expanded ent1">
]>
<?pi1 ?>
<root attribute="attribute.value">
    element.text.1
    <e1><![CDATA[cdata.content]]></e1>
    <e2><!--comment.content--></e2>
    <e3>&ent1;</e3>
    element.text.2
</root> 
Remarks section clarifies something:
When concatenated, the text represents the contents of text or CDATA nodes. All concatenated text nodes are normalized according to xml:space attributes and the value of the preserveWhiteSpace switch. Concatenated CDATA text is not normalized. (Child nodes that contain NODE_COMMENT and NODE_PROCESSING_INSTRUCTION nodes are not concatenated.) .text trims the whitespace on the edges of the result, and "normalizes" \r\n => \n, but otherwise just concatenates text.
Retrieves and sets the string representing the text contents of this node or the concatenated text representing this node and its descendants.
For more precise control over text manipulation in an XML document, use the lower-level nodeValue property, which returns the raw text associated with a NODE_TEXT node.
For this sample it returns:
element.text.1 cdata.content expanded ent1 element.text.2
Both comments skipped, OK, but I still, miss the text of my NODE_ENTITY.
If requested ditectly NODE_ENTITY.text returns:
expanded ent1
So I would expect:
expanded ent1 element.text.1 cdata.content expanded ent1 element.text.2
Why is NODE_ENTITY.text missing from NODE_DOCUMENT.text ? Maybe because it is inside NODE_DOCUMENT_TYPE which claims to return .text as "" ? Or because :text", does not mean text but nodeValue which is defined as null for both NODE_DOCUMENT_TYPE and NODE_ENTITY.

Results:
From my quick tests Document.text behaves the same as Document.documentElement.text. If anyone can show, how the may differ I would be pleased. Until then, considered as bad design, useless w3 deviation and insufficent documentation.

No comments:

Post a Comment