How Do I Query A Database Field But Ignore The Html Markup?

September 30, 2023 Post a Comment

We have a field that contains HTML markup for formatting on the website, but we need to query just the text that should render on screen, not things like CSS tags, tag names, prope

Solution 1:

Here's a User Defined Function for just that which you could leverage:

http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

Solution 2:

Processing like this should not be done in the database. I would recommend creating a separate field containing only the text contents.

In response to @Nissan Fans comment: Extracting text from HTML is not the database's job IMO. It's too complex a job for it, and it has too many variables. I'm not well versed in reading stored procedures but if I read the code correctly, it will have problems with an (invalid but still often occuring) unencoded < in the source code. And it will most likely break for invalid HTML.

Or imagine one day, the customer comes and wants img elements' ALT properties indexed too. Or titles. Start building that with a "start position, end position" algorithm. You will go crazy.

I say, if this is needed to process HTML from varying sources outside your control on a day-to-day basis, leave this to a layer above the DB that is better equipped to handle this stuff. A DOM based approach - perhaps using BeautifulSoup to be able to deal with invalid HTML - parsing out all nodeValues would be the most reliable thing.

Maybe this is overkill, and the stored procedure will work just fine in the OP's case - it looks like it from his comment, and that's perfectly all right. I'm just saying, if you can't control the incoming HTML, don't strip HTML with the limited means the database offers for the job.

Solution 3:

I agree with Pekka's; this isn't something that your database should be dealing with.

Cons to doing this parsing in the DB:

Performance issues. Using UDFs can degrade performance and lead to table scans. And even if you avoid table scans, you're still asking the DB to do a bunch of stuff (string manipulation) that it wasn't designed to do.
Harder to get right. Correctly parsing HTML is a tough job. True you can get 95% of the way there with a UDF, but handling this in the application layer might get you 100% of the way there.
Harder to test. I'd much rather write unit tests for HTML stripping code that execute in C# against string literals, rather than having to round-trip to the DB.

If you must do this in the DB...

If doing this in the DB is a requirement, consider this approach:

Add a second field to your DB to hold the plain text version of the contents.
Add a trigger so that each time the HTML value is changed, the text version is regenerated.
Write your queries against the plain text field.

You'll get better performance because you're only doing the parsing at write time, rather than on every search, and your DB will make better use of any indexes you define on the plain text field.

Solution 4:

If you can run regular expressions in your query, you can strip out the HTML and return only the text using the examples here: http://www.regular-expressions.info/examples.html

Solution 5:

If you attempt to index one of these columns and access it by removing the html:

WHERE dbo.anyRemoveHtml(yourColumn)='your search text'

the index will not be used and you will table scan. this might not be a problem when the application has little data, but will result in slower and slower SELECTs as more data is added to the table.

note: dbo.anyRemoveHtml is just a made up name representing the function that you select to remove the HTML, and does not really exist

Html5 Library