Thursday, September 23, 2010

OWASP has deleted How_to_perform_HTML_entity_encoding_in_Java

http://www.owasp.org/index.php?title=How_to_perform_HTML_entity_encoding_in_Java

I have fixed this "naive article" back in spring 2009
an it contained my proposal for "HTML encoding".

Week ago I have discovered mistake in my code:
2 chars which I should exclude from output
where not excluded and outputed as encoded.

I wanted to update the alg on the web and surprise:
HTML Entity Encoding is not enough to stop XSS in web applications. Please see

XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet for more information.

So let's see what is the OWASPS update ?
Article named: XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet.

Why Can't I Just HTML Entity Encode Untrusted Data?
HTML entity encoding is okay for untrusted data that you put in the body of the HTML document, such as inside a div tag. It
even sort of works for untrusted data that goes into attributes, particularly if you're religious about using quotes around
your attributes. But HTML entity encoding doesn't work if you're putting untrusted data inside a script tag anywhere, or an
event handler attribute like onmouseover, or inside CSS, or in a URL. So even if you use an HTML entity encoding method
everywhere, you are still most likely vulnerable to XSS. You MUST use the escape syntax for the part of the HTML document
you're putting untrusted data into. That's what the rules below are all about.
Ok it covers more in one place, excelent....
introduces "terms" like "HTML Escape" or "Attribute Escape"....
and no surprise it is strong propagation of
ESAPI and ESAPI reference implementation.

BEWARE

Check code here:
http://code.google.com/p/owasp-esapi-java/source/browse/trunk/src/main/java/org/owasp/esapi/codecs/HTMLEntityCodec.java

and
latest version of mine "pseudo-code"
still kept inside owasps wiki history.

compare and decide .....

Mine works for "Supplementary Multilingual Plane"
uses only
Numeric character references not Character entity references.

and it's imune to client charset swithing..

Probably we will here more about ESAPI, since they "amuse and scare me" more and more every day....
-------------
BUG FIX: Two extra chars to remove are 0b 0c. (swich ifs or add extra if else line). Sorry....

3 comments:

  1. There are a number of problems with your suggested code...
    1) Encoding characters > 255 isn't useful, barring games with the character set.
    2) There is no security problem with rendering named entities, although ESAPI uses hex entities to help performance.
    3) Nobody is immune to charset switching
    4) It's dangerous to remove characters entirely, you should replace with u+FFFD

    Feel free to post comments, but why not help us make ESAPI better? You can post bugs at the GoogleCode repository and help us make ESAPI better.

    ReplyDelete
  2. oooooh, thanx for audience !
    1) barring ?
    2) true, just security is not a holy grail, please could you explain benefits of entities instead of numbers ?
    3) ??? teach me please, IMHO if anything except a-z (we still talk about HTML markup here) is coded how this is NOT imune ?
    4) said who , link please ?

    ReplyDelete
  3. Funny that OWASP did not react on the article code for a year and half, in any way, and I did not get any feedback earlier. I have also reviewed libraries linked in referencies (also OWASP labeled) and published bug reports.... I have tried to "make thinks better" with no interest from others. I;m not entering that river again....

    Anyway... Thanx fot teaching me now....
    I can agree on 2 more points 2 and 4.
    still need more explanation on 1 and 3.

    1. 7 bit media ? or I'm just too old ? user agent unwarned charset switching ?
    2. sorry you are right, quick reading of answer, yes hex could be a bit shorter even under 128 ;-)) none of use encodes with "named entites" anymore.... even if SGML fanatics may not agree ;-)
    3. what charset swith would harm 7bit and entities ? still did not get it clearly
    4. Said by HTML specs: Adopt a clearly visible, but unobtrusive mechanism to alert the user of missing resources. If missing characters are presented using their numeric representation, use the hexadecimal (not decimal) form since this is the (ok I will accept fffd it looks almost clear good in all browsers, however FF numeric rendering is even nicer ;-))))


    Thanx, will fix code and make it faster and configurable....

    ReplyDelete