Entry
How do I remove HTML comments?
How do I match HTML comments?
Apr 7th, 2008 22:59
ha mo, Mark Szlazak, anita wigginton,
An HTML comment declaration consists of <! followed by zero or more
comments followed by >. Each comment starts with -- and includes
all text up to and including the next occurrence of --. In a comment
declaration, white space is allowed after each comment, but not before
the first comment.
This means that the following are all legal HTML comments:
1a. <!-- Hello -->
1b. <!--
Hello!
The tag-pair <B>...</B> bolds any text inside.
-->
2a. <!-- Hello -- -- Goodbye -- >
2b. <!-- Hello --
-- Goodbye -- >
3. <!---->
4. <!------ Hello -->
5. <!------> Hello -->
6. <!>
Note that a comment tag with just -- characters should always have a
multiple of four - characters to be legal. However, not all HTML
parsers follow this rule and non-compliant sequences of - like <!----->
maybe allowed. These sequences are often used by people as seperators
in their source code.
In Javascript 1.5 the following regular expression will match most
HTML comments:
regX = /<!(?:--.*?--\s*)?>/g;
However, it fails on comments that span multiple lines. The dot
metacharacter in .*? matches anything except a newline character and
there is no modifier that turns dot into a metacharacter to match any
character. This causes the expression to fail on multiline comments
like 1b.
One way to overcome this is by replacing . by alternative groupings
like (?:.|\n), (?:[^-]|-[^-]) or (?:[^-]|-(?!-)), but the last
two groupings won't catch the illegal <!-------> type comments mentioned
previously. Furthermore, alternative groupings are inefficient when
compared to character classes. Unfortunately the character class [.\n]
won't work since the dot metacharacter is not found within classes.
Instead, use [\s\S], [\d\D] or [\w\W] to match any character. Also, a
whitespace specification is added to the expression so no empty lines
are left behind when comments are removed.
regX = /<!(?:--[\s\S]*?--\s*)?>\s*/g;
The following function is passed an HTML string and returns a string
with all HTML comments removed.
function removeHTMLComments(html) {
return html.replace(/<!(?:--[\s\S]*?--\s*)?>\s*/g,'');
}
A problem with this is that it will incorrectly remove code inside <!--
--> that are found inside <script...></script> or <style...></style>
tags. Recall the comment tricks used to hide scripts and styles from
some old browsers.
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HEAD>
<TITLE>HTML Comment Example</TITLE>
<!-- Id: html-sgml.sgm,v 1.5 2021/05/26 21:29:50 connolly Exp -->
<STYLE type="text/css"><!--
STYLE BLOCK SHOULD REMAIN!
--></STYLE>
<SCRIPT LANGUAGE="JavaScript">
<!-- hide this stuff from other browsers
SCRIPT BLOCK (1) THAT SHOULD NOT BE REMOVED!
// end the hiding comment -->
</SCRIPT>
</HEAD>
<BODY>
<!-- another --
-- comment -->
<P>Not a <I>comment</I>, just regular old data characters.</P>
<SCRIPT LANGUAGE="JavaScript">
<!-- hide this stuff from other browsers
SCRIPT BLOCK (2) THAT SHOULD NOT BE REMOVED!
// end the hiding comment -->
</SCRIPT>
<!>
</BODY>
</HTML>
To avoid this problem the following alternation grouping is used to
also match <SCRIPT...> </SCRIPT> and <STYLE...> </STYLE> sections.
regX = /<(?:!(?:--[\s\S]*?--\s*)?(>)\s*|
(?:script|style|SCRIPT|STYLE)[\s\S]*?<\/(?:script|style|SCRIPT|STYLE)
>)/g;
Also, a function instead of a string is used as the replacement in the
replace() method. For more on functions as replacements see:
http://www.faqts.com/knowledge_base/view.phtml/aid/15940
The first argument to this function is the string that matched the
pattern. The second argument is the string that matched the capturing
parenthesized subexpression. This parenthesized subexpression is used
as a flag to indicate a match NOT within a SCRIPT or STYLE section.
Comments of HTML sections are replaced by an empty string but SCRIPT
and STYLE sections are replaced by copies of themselves.
Since legal identifier names in JavaScript can have a dollar sign ($)
as the first character, one could assign the functions arguments to
identifier names in correspondence to those used in replacement strings.
function removeHTMLComments(html) {
return html.replace(regX, function(m,$1) {
return $1? '':m;
});
}
http://www.businessian.com
http://www.computerstan.com
http://www.financestan.com
http://www.healthstan.com
http://www.internetstan.com
http://www.moneyenews.com
http://www.technologystan.com
http://www.zobab.com
http://www.healthinhealth.com