Skip to content Skip to sidebar Skip to footer

Parse Html User Input

Let's say I have a string from the user ($input). I can go and strip tags, to allow only allowed tags in. I can convert to text with htmlspecialchars(). I can even replace all tags

Solution 1:

This is what I eventually used:

functionhtml($input) {
    $input = preg_replace(["#&([^A-z])#","#<([^A-z/])#","#&$#","#<$#"], ['&amp;$1','&lt;$1','&amp;','&lt;'], $input); //Fix single "<"s and "&"s$open = []; //Array of open tags$close = false; //Is the current tag a close tag?for ($i = 0; $i <= strlen($input); $i++) { //Start the loopif ($tag) { //Are we in a tag?if (preg_match("/[^a-z]/", $input[$i])) { //The tag has endedif ($close) {
                    $close = false;
                    $sPos = strrpos(substr($input,0,$i), '<') + 2; //start position of tag$tag = substr($input,$sPos,$i-$sPos); //tag nameif (end($open) == $tag) {
                        array_pop($open); //Good, it's a valid XML closing
                    } else {
                        $input = substr($input, 0, $sPos-2) . '&lt;/' . $tag . substr($input, $i); //BAD! Convert tag to text (open tag will be handled later)
                    }
                } else {
                    $sPos = strrpos(substr($input,0,$i), '<') + 1; //start position of tag$tag = substr($input,$sPos,$i-$sPos); //tag nameif (in_array($tag, ['em','i','del','sub','sup','sml','code','kbd','pre','codebl','bl','sbl'])) { //Is it an acceptable tag?
                        array_push($open, $tag); //Add it to the array$j = $i + 1;
                        while (preg_match("/\s/", $input[$j])) { //Get rid of whitespace$j++;
                        }
                        $input = substr($input, 0, $sPos - 1) . '<' . $tag . '>' . substr($input, $j); //Seems legit
                    } else {
                        $input = substr($input, 0, $sPos - 1) . '&lt;' . $tag . substr($input, $i); //BAD! Convert tag to text
                    }
                }
                $tag = false;
            }
        } elseif (!in_array('code', $open) && !in_array('codebl', $open) && !in_array('pre', $open)) { //Standard parsing of textif ($input[$i] == '<') { //Is it a tag?$tag = true;
                if ($input[$i+1] == '/') { //Is it a close tag?$i++;
                    $close = true;
                }
            } elseif (substr($input, $i, 4) == 'http') { //Linkif (preg_match('#^.{'.$i.'}(https?):\/\/([^\s"\(\)<>]+)#', $input, $m)) {
                    $insert = '<a href="'.$m[1].'://'.$m[2].'" target="_blank">'.$m[2].'</a>';
                    $input = substr($input, 0, $i) . $insert . substr($input, $i + strlen($m[1].'://'.$m[2]));
                    $i += strlen($insert);
                }
            } elseif ($input[$i] == "\n" && $input[$i+1] == "\n") { //Insert <bl> tag? (I use this to separate sections of text)$input = substr($input, 0, $i + 1) . '</bl><bl>' . substr($input, $i + 1);
            }
        } else { // We're in a code tagif (substr($input, $i+1, strlen(end($open)) + 3) == '</'.current($open).'>') {
                array_pop($open);
                $i += 2;
            } elseif ($input[$i] == '<') {
                $input = substr($input, 0, $i) . '&lt;' . substr($input, $i + 1);
                $i += 3; //Code tags have raw text
            } elseif (in_array('code', $open) && $input[$i] == "\n") { //No linebreaks are allowed in inline tags, convert to <codebl>$open[count($open) - 1] = 'codebl';
                $input = substr($input, 0, strrpos($input,'<code>')) . '<codebl>' . substr($input, strrpos($input,'<code>') + 6, strpos(substr($input, strrpos($input,'<code>')),'</code>') - 6) . '</codebl>' . substr($input, strpos(substr($input, strrpos($input,'<code>')),'</code>') + strrpos($input,'<code>') + 7);
                $i += 4;
            }
        }
    }
    while ($open) { //Handle open tags$input .= '</'.end($open).'>';
        array_pop($open);
    }
    return'<bl>'.$input.'</bl>';
}

I know it's a bit more risky, but you can first assume the input's good, then filter out the stuff explicitly found as bad.

Post a Comment for "Parse Html User Input"