Encoding / Decoding Standards for Language Agnostic Cookies

Question

Encoding / Decoding Standards for Language Agnostic Cookies

I find it difficult to understand what is the standard (or is it?) For cookies for encoding / decoding, regardless of the backend platforms.

According to RFC 2109 :

VALUE is opaque to the user agent and may be what the source server chooses to send, possibly in ASCII encoding selected on the server. “Opaque” implies that the content is interesting and appropriate only for the source server. In fact, the content may be readable by anyone studying the Set-Cookie header.

which sounds like “server is the boss” and he decides that the encoding will be applied. This makes it difficult to set up a cookie, say, a PHP backend and read it from Python or Java or something else, without writing manual encoding / decoding on both sides.

Say we need to encode a value. Russian /" (*} "/ means "cookie value" with some additional non-alpha-numeric characters.

Python:

Almost every WSGI server does the same and uses the Python SimpleCookie class, which encodes / decodes octal literals, although many say octal literals are deprecated in ECMA-262 in strict mode. Wtf?

So, our original cookie value becomes "/\"\320\277\320\265\321\207\320\265\320\275\321\214\320\265 (*} \320\267\320\275\320\260\321\207\320\265\320\275\320\270\321\217\"/"

Node.js

Not tested at all, but I just assume that the JavaScript backend will do this using the native encodeURIComponent and decodeURIComponent that use hexadecimal escaping / deletion?

PHP:

PHP applies urlencode to cookie values that are similar to encodeURIComponent but not exactly the same.

So the original value becomes; %2F%22%D0%BF%D0%B5%D1%87%D0%B5%D0%BD%D1%8C%D0%B5+%28%2A%7D+%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D1%8F%22%2F , which is not even wrapped in double quotes.

Nevertheless; if the JavaScript value variable has a PHP encoded value above, decodeURIComponent(value) gives /"+(*}+"/ , see "+" characters instead of spaces.

What is the situation in Java, Ruby, Perl and .NET? Which language matches (or is closer to) the desired behavior. In fact, is there any standard for this defined by W3?

+8

language-agnostic encoding cookies web-standards decoding

kirpit Feb 24 '13 at 19:36

source share

3 answers

Hazzit · Answer 1 · 2013-03-05T20:11:27+0000

I think you have something mixed up here. Server encoding does not matter to the client, and it should not. This is what RFC 2109 is trying to say here.

The concept of cookies in http is similar to this in real life: after paying the entrance fee to the club, you get an ink stamp on your wrist. This allows you to leave and re-enter the club without paying once again. All you have to do is show your wrist for the bouncer. In this real-life example, you don't care what it looks like, it may even be invisible in a normal light - all that matters is that the bouncer recognizes the thing. If you were to wash it, you would lose the privilege of re-entering the club without paying even more.

In HTTP, the same thing happens. The server sets a cookie with a browser. When the browser returns to the server (read the following HTTP request), it will display a cookie on the server. The server recognizes the cookie and acts accordingly. Such a cookie can be as simple as the “WasHereBefore" token. Again, it doesn’t matter that the browser understands what it is. If you delete your cookie, the server will act as if it had never seen you before, just like a bouncer in this club if you washed away this ink stamp.

Today, many cookies store only one important information: session ID. Everything else is stored on the server side and is associated with this session identifier. The advantage of this system is that the actual data never leaves the server and, as such, can be trusted. Everything that is stored on the client side can be changed and should not be trusted.

Edit: After reading your comment and re-reading your question, I think I finally understood your situation and why you are interested in the actual encoding of the cookie, and not just leave it in your programming language. If there are two different software environments on the same server (for example, Perl and PHP), you might want to decode a cookie that was set in a different language. In the above example, PHP should decode the Perl cookie or vice versa.

There is no standard in how data is stored in a cookie. The standard only says that the browser will send the cookie exactly as it was received. The coding scheme used depends on your programming language.

Returning to the example of real life, you now have two bouncers, one of whom speaks English, the other speaks Russian. Both will have to agree on one type of ink. Most likely, this will lead to the fact that at least one of them learns another language.

Since the behavior of the browser is standardized, you can either simulate the coding scheme of one language in all other languages that are used on your server, or simply create your own standardized coding scheme in all the languages used. You may need to use lower-level routines such as PHP header() instead of higher-level start_session() such as start_session() .

BTW: In the same way, the server-side programming language decides how to store session data on the server side. You cannot access Perl CGI::Session with the PHP $_SESSION .

ykaganovich · Answer 2 · 2013-03-06T00:00:45+0000

Regardless of the fact that the cookie is opaque to the client, it must still comply with the HTTP specification. rfc2616 indicates that all HTTP headers should be ASCII (ISO-8859-1). rfc5987 extends this to support other character sets, but I don't know how widely this is supported.

pestilence669 · Answer 3 · 2013-03-06T07:21:32+0000

I prefer to encode in UTF8 and wrap with base64 encoding. It is fast, ubiquitous and will never cripple your data from both ends.

You will need to provide explicit conversion to UTF8 even when packaging it. Other languages and battery life, supporting Unicode, cannot store strings like UTF8 inside ... like many Windows APIs. Python 2.x, in my experience, rarely gets Unicode strings without explicit conversion.

ENCODE: nativeString -> utfEncode () -> base64Encode ()

DECODE: base64Decode () → utfDecode () → nativeString

Almost every language I know about today supports this. You can search for universal, single-function code, but I am mistaken on the side of caution and choose a two-step approach ... especially with foreign character sets.

Encoding / Decoding Standards for Language Agnostic Cookies

More articles: