URL encoding rules with the `javascript:` pseudo-protocol

Is there any authoritative link to the syntax and URL encoding for the javascript: pseudo-protocol:? (I know this is not well reviewed, but in any case it is useful for bookmarklets).

First, we know that standard URLs follow the syntax:

 scheme://username: password@domain :port/path?query_string#anchor 

but this format does not seem to apply here. Apparently, it would be more correct to talk about the URI instead of the URL : here is the "unofficial" javascript:{body} format javascript:{body} .

So, what are the valid characters for such a URI (what are the escape / unescape rules) when embedding in HTML?

In particular, if I have javascript function code and I want to embed it in javascript: URIs, which are the escape rules to apply ?

Of course, every non-alphabetic character could be avoided, but that would be redundant and make the code unreadable. I want to avoid only the necessary characters.

Further, it is clear that it would be bad to use some urlencode / urldecode pair (for query string values), we do not want, for example, to decode "+" to spaces.

+4
source share
1 answer

My findings so far:

Firstly, there are rules for writing a valid HTML attribute value: but here the standard only requires (if the value of the attribute enclosed in quotation marks) an arbitrary CDATA (actually % URI , but HTML itself does not impose additional verification at its level: any CDATA will check )

Some examples:

  <a href="javascript:alert('Hi!')"> (1) <a href="javascript:if(a > b && 1 < 0) alert( b ? 'hi' : 'bye')"> (2) <a href="javascript:if(a&gt;b &amp;&&amp; 1 &lt; 0) alert( b ? 'hi' : 'bye')"> (3) 

Example (1) is valid. But the example is also true (2) HTML 4.01 Strict. To make it valid XHTML, we only need to avoid the special XML characters < > & (Example 3 is valid XHTML 1.0 Strict).

Now, example (2) valid javascript: URI? I'm not sure, but I would say no.

From RFC 2396 : URIs are subject to some additional restrictions, and in particular escape / unescape through %xx sequences. And some characters are always forbidden: among them are spaces and {}# .

The RFC also defines a subset of opaque URIs : those that do not have hierarchical components and for which separators do not have special meaning (for example, they do not have a query string, so ? Can be used like any non-special character). I suggest that javascript: URI should be considered among them.

This would mean that the valid characters inside the "body" of the javascript: URI are

  a-zA-Z0-9 _|. !~*'();?:@&=+$,/- %hh : (escape sequence, with two hexadecimal digits) 

with an additional restriction that he cannot start with / . Some โ€œimportantโ€ ASCII characters, such as

 {}#[]<>^\ 

Also % (because it is used for escape sequences), double quotation marks " and (most importantly) all spaces.

In some respects, this seems pretty permissive: itโ€™s important to note that + is valid (and therefore should not be โ€œunescapedโ€ when decoding as a space).

But in other respects, this seems too restrictive. Brackets and brackets, especially: I understand that they are usually used without screens, and browsers have no problems.

What about spaces? As braces, they are prohibited by the RFC, but I do not see any problems in this URI. However, I see that in most bookmarklets they are escaped as "% 20". Is there any (empirical or theoretical) explanation for this?

I still don't know if there are any standard functions for this escape / unescape (in the main languages) or some kind of sample code.

+4
source

Source: https://habr.com/ru/post/1315843/


All Articles