After spending quite some time looking for a simple and (sigh...) not Node.js dependant HTML Sanitizer without any luck — of the ones I've found they often converted the <
and >
to their respective entities (leaving the input even dirtier) or then they removed the whole tag and its content (which is not what I'm looking for), I decided to simply remove unwanted tags, allowing only those from a very specific subset and found this answer:
const VALID_TAGS = [
'b', 'strong', 'i', 'em', 's', 'a', 'img', 'blockquote', 'ul', 'li'
];
$("#text *").not( VALID_TAGS.join( ',' ) ).each(function() {
var content = $(this).contents();
$(this).replaceWith(content);
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div id="text">
<blockquote>
<p>Lorem ipsum dolor...</p>
<footer>
By <a href="https://www.github.com/someuser"><strong>Some User</strong></a>
in March, 13
</footer>
</blockquote>
</div>
It works flawlessly for my needs, however, it relies on jQuery to do the job.
How could I accomplish the same without it -AND- from a string, not some selector or ready-to-use Element?
This is merely for display. I am NOT sanitizing or, better saying, filtering the HTML to use "as is" by the server-side ;)
querySelectorAll, a loop, and replaceWith
const VALID_TAGS = [
'b', 'strong', 'i', 'em', 's', 'a', 'img', 'blockquote', 'ul', 'li'
];
document.querySelectorAll("#text *").forEach(elem => {
if (!VALID_TAGS.includes(elem.tagName.toLowerCase())) {
elem.replaceWith(...elem.childNodes);
}
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div id="text">
<blockquote>
<p>Lorem ipsum dolor...</p>
<footer>
By <a href="https://www.github.com/someuser"><strong>Some User</strong></a> in March, 13
</footer>
</blockquote>
</div>
If the string, you can use DOM parser to get it as HTML
const parser = new DOMParser();
const htmlDoc = parser.parseFromString(txt, 'text/html');
If it is on the server, you need to use a library
Well, it might not be the best way of doing it but after a lot of thinking and researching I managed to get the job with a simple Regex — 'til someone shows up with a complete example of how to do it better:
const text = `<div id="text">
<blockquote>
<p>Lorem ipsum dolor...</p>
<footer>
By <a href="https://www.github.com/someuser"><strong>Some User</strong></a>
in March, 13
</footer>
</blockquote>
</div>`;
const VALID_TAGS = [
'b', 'strong', 'i', 'em', 's', 'a', 'img', 'blockquote', 'ul', 'li'
];
function stripTags( input, allowed ) {
return input.replace( new RegExp( `\\s*<(?!\/?\\b(${allowed.join('|')})\\b)[^>]+>\\s*`, 'gmi' ), '');
}
console.log( stripTags( text, VALID_TAGS ).trim().replace(/(\r\n|\n|\r)/gm, '' ).replace( /\s{2,}/gm, ' ' ) )