Skip to content Skip to sidebar Skip to footer

How Do Html Parsers Process Text Outside Elements (text Nodes)

Ref this question: Add html tag to string in PHP Questioner asks how to properly detect untagged text in a HTML file, ( he wanted to insert tags as needed). He provided this exampl

Solution 1:

I don't suppose anyone else will post a reply so for the record I am recording here what I learned from the comments and sound advice of sideshowbarker

What does the latest HTML5 standard say about untagged text and how it should be treated?

Untagged text is entered into the DoM as a text node. The text node is inserted as a child node of the element in which it appears. For example in this snippet:

<body>
    <h2><b>Hello World</b></h2>
    <p>First</p>
    Second
    <p>Third</p>
</body>

... "Second" is part of a text node (nodeType=3) which is a child node of the body element.

In fact there are 4 child text nodes (nodeValues of each shown in list below).

  1. "CR-LF " after the opening body tag.
  2. "CR-LF " after the <h2><b>Hello World</b></h2> element
  3. "CR-LF Second-CR-LF " after the <p>First</p> element
  4. "CR-LF " after the <p>Third</p> element

Probably most "uglifiers" will remove the CR_LF and spaces from text node, which in most cases can remove them altogether.

How do current HTML parsers treat untagged text?

As above, but with at least these qualifiers:

  1. untagged text (be it formatting or alphanumeric or both) between the <html> tags but outside the <body> tags, will be moved inside the <body> element.
  2. If <body> tags are missing the parser will insert them.

For example, using PHPDocument (PHP inbuilt Dom parser) this input..

<html>
    text before body
<body><h2><b>Hello World</b></h2><p>First</p>
    Second
    <p>Third</p>
    fourth
    <p>Third</p><!-- comment --></body>
    text after body
</html>

..produced this DoM (untagged text moved into the <body> element).

<html><body><p>
    text before body
</p><h2><b>Hello World</b></h2><p>First</p>
    Second
    <p>Third</p>
    fourth
    <p>Third</p><!-- comment -->

    text after body
</body></html>

and this input..

<html><h2><b>Hello World</b></h2><p>First</p>
    Second
    <p>Third</p>
    fourth
    <p>Third</p><!-- comment --></html>

..produced this DoM (<body> tags inserted by the parser)

<html><body><h2><b>Hello World</b></h2><p>First</p>
    Second
    <p>Third</p>
    fourth
    <p>Third</p><!-- comment --></body></html>

Could the problem in question in SO52159323 have been solved using an HTMLParser class (in whatever language). I mean by running the text past the parser and expecting the parser to identify the untagged text and its location?

Yes. See code fragment in my answer at Add html tag to string in PHP. Of course the parser produces the DoM making it possible to search out candidate node and doing the required processing.

Post a Comment for "How Do Html Parsers Process Text Outside Elements (text Nodes)"