Skip to content Skip to sidebar Skip to footer

Best Way To "fix" Malformed Html For Use In An Xsl Transform

I have an input xml document that contains mal-formed html which has been xml encoded. i.e. the xml document itself is technically valid. Now I am applying an xsl transform to the

Solution 1:

Is using a third party library acceptable? The HTML Agility Pack (available on NuGet) might got part of the way to solving your invalid HTML and it also (according to the website) supports XSLT.


Solution 2:

I went for using a sgml parsing library and converting to valid xml.

I went for Mind Touch's library: https://github.com/MindTouch/SGMLReader

Once compiled and added to the GAC I could use this xsl:

<msxsl:script language="C#" implements-prefix="myns">
  <msxsl:assembly name="SgmlReaderDll, Version=1.8.11.0, Culture=neutral, PublicKeyToken=46b2db9ca481831b"/>
    <![CDATA[
 public XPathNodeIterator SGMLStringToXml(string strSGML)
 {
 Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
 sgmlReader.DocType = "HTML";
 sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
 sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
 sgmlReader.InputStream = new System.IO.StringReader(strSGML);

 // create document
 XmlDocument doc = new XmlDocument();
 doc.PreserveWhitespace = true;
 doc.XmlResolver = null;
 doc.Load(sgmlReader);
 return doc.CreateNavigator().Select("/*");
 }

 public string CurDir()
 {
 return (new System.IO.DirectoryInfo(".")).FullName;
 }
  ]]>

</msxsl:script>
<xsl:template match="node()" mode="PreventSelfClosingTags">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
    <xsl:text> </xsl:text>
  </xsl:copy>
</xsl:template>
<xsl:template match="@*" mode="PreventSelfClosingTags">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

and use it like so:

<xsl:apply-templates select="myns:SGMLStringToXml(.)/body/*" mode="PreventSelfClosingTags"/>

N.B. You have to run the transform manually with an XslCompiledTransform instance. The asp:xml control doesn't like the DLL reference.


Post a Comment for "Best Way To "fix" Malformed Html For Use In An Xsl Transform"