My findings so far:
Firstly, there are rules for writing a valid HTML attribute value: but here the standard only requires (if the value of the attribute enclosed in quotation marks) an arbitrary CDATA (actually % URI , but HTML itself does not impose additional verification at its level: any CDATA will check )
Some examples:
<a href="javascript:alert('Hi!')"> (1) <a href="javascript:if(a > b && 1 < 0) alert( b ? 'hi' : 'bye')"> (2) <a href="javascript:if(a>b &&& 1 < 0) alert( b ? 'hi' : 'bye')"> (3)
Example (1) is valid. But the example is also true (2) HTML 4.01 Strict. To make it valid XHTML, we only need to avoid the special XML characters < > & (Example 3 is valid XHTML 1.0 Strict).
Now, example (2) valid javascript: URI? I'm not sure, but I would say no.
From RFC 2396 : URIs are subject to some additional restrictions, and in particular escape / unescape through %xx sequences. And some characters are always forbidden: among them are spaces and {}# .
The RFC also defines a subset of opaque URIs : those that do not have hierarchical components and for which separators do not have special meaning (for example, they do not have a query string, so ? Can be used like any non-special character). I suggest that javascript: URI should be considered among them.
This would mean that the valid characters inside the "body" of the javascript: URI are
a-zA-Z0-9 _|. !~*'();?:@&=+$,/- %hh : (escape sequence, with two hexadecimal digits)
with an additional restriction that he cannot start with / . Some โimportantโ ASCII characters, such as
{}#[]<>^\
Also % (because it is used for escape sequences), double quotation marks " and (most importantly) all spaces.
In some respects, this seems pretty permissive: itโs important to note that + is valid (and therefore should not be โunescapedโ when decoding as a space).
But in other respects, this seems too restrictive. Brackets and brackets, especially: I understand that they are usually used without screens, and browsers have no problems.
What about spaces? As braces, they are prohibited by the RFC, but I do not see any problems in this URI. However, I see that in most bookmarklets they are escaped as "% 20". Is there any (empirical or theoretical) explanation for this?
I still don't know if there are any standard functions for this escape / unescape (in the main languages) or some kind of sample code.