How to strip all HTML tags and entities and get clear text?

I was encouraged to write this Tip/Trick because of so many questions received for this issue.
Suppose you’re having a bunch of HTML strings, but you just want to remove all the HTML tags and want a plain text.

You can use Regex to come to the rescue.

The Regex I had developed before was more cumbersome, then Chris made a suggestion, so I will now go further with the regex suggested by Chris that is a "\<[^\>]*\>".

I have tested it for many cases. It detects all types of HTML tags, but there may be loopholes inside so if you find any tags which are not passing through this Regex, then kindly inform me about the same.

Regex Definition

  • Regex :\<[^\>]*\>
    • Literal >
    • Any character that NOT in this class:[\>], any number of repetations
    • Literal >

Visual Basic

''' 
''' Remove HTML from string with Regex
''' 
Function StripTags(ByVal html As String) As String
    ' Remove HTML tags.
    Return Regex.Replace(html, "<.*?>", "")
End Function

C#

/// 
/// Remove HTML from string with Regex
/// 
public static string StripTags(string source)
{
    return Regex.Replace(source, "<.*?>", string.Empty);
}

Happy coding!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.