Allowed Characters
RFC 3986 defines what characters are allowed in respect of URI components.
RFCs for specific URI schemes can further limit this.
If you are interested in HTTP / HTTPS URIs: they are defined in RFC 7230 . AFAIK, they have no additional restrictions on valid characters, so you can stick to the definitions in RFC 3986.
What happens if invalid characters are used?
Depending on many factors ... there could be anything: from "nothing happens" to "no longer works."
Does the URL identify it by itself, encoding illegal characters into something else?
URI can not fix, its just a string.
Clients working with this URI (browser, server, email client, etc.) may try to fix the URI (or work with invalid URIs) in accordance with their own rules.
URI and link
Also note that this is the difference between a URI and binding to (or storing, etc.) that URI in a document.
The host language (e.g. HTML) may have encoding rules. This does not change the URI, only how the URI is stored / indicated in this document.
For example, a valid URI http://example.com/a&b should be linked this way in HTML documents:
<a href="http://example.com/a&b">Link</a>
URI but still http://example.com/a&b , not http://example.com/a&b .
source share