Canonicalize lowercase URLs
I want to write an HTTP module that converts URLs to lowercase. My first attempt ignored international character sets and works fine:
// Convert URL virtual path to lowercase string lowercase = context.Request.FilePath.ToLowerInvariant(); // If anything changed then issue 301 Permanent Redirect if (!lowercase.Equals(context.Request.FilePath, StringComparison.Ordinal)) { context.Response.RedirectPermanent(...lowercase URL...); }
Test in Turkey (international cultures):
But what about other cultures than in the US? I referenced Turkey Test to find a test URL:
http:
This little insidious stone destroys any idea that converting code into URLs is easy! Its lowercase and upper case, respectively, are as follows:
http://example.com/ııii http://example.com/IIİİ
To convert the case to work with Turkish URLs, I first had to set the current ASP.NET culture to Turkish:
<system.web> <globalization culture="tr-TR" /> </system.web>
Then I had to change my code to use the current culture to convert case:
// Convert URL virtual path to lowercase string lowercase = context.Request.FilePath.ToLower(CultureInfo.CurrentCulture); // If anything changed then issue 301 Permanent Redirect if (!lowercase.Equals(context.Request.FilePath, StringComparison.Ordinal)) { context.Response.RedirectPermanent(...); }
But wait! Will StringComparison.Ordinal
still work? Or should I use StringComparison.CurrentCulture
? I'm not sure either!
File Names: It Gets MUCH WORSE!
Even if the above works, using the current conversion culture in the event of an NTFS file system break! Let's say I have a static file called Iıİi.html
:
http:
Although the Windows file system is not case sensitive, it does not use a language culture. Converting the above URL to lowercase results in a 404 error because the file system does not consider these two names equal:
http:
Correct case conversion for file names? Who knows?!
The MSDN article Recommendations for Using Strings in the .NET Framework has a note (about halfway to the article):
Note: The string behavior of the file system, registry keys and values, and environment variables is best represented in StringComparison.OrdinalIgnoreCase.
BUT? The best are presented ??? Is this the best we can do in C #? So what is the correct case conversion to fit the file system? Who knows?!!? . All we can say is that comparing strings using the above is likely to work the MOST of that time.
Summary: Converting two cases: static / dynamic URLs
- So we saw that static URLs --- URLs that have a file path that matches the real directory / file in the file system --- must use an unknown case transformation that is just the “best represented”
StringComparison.OrdinalIgnoreCase
. And note that there is no string.ToLowerOrdinal()
method, so it’s very difficult to know exactly which case conversion corresponds to OrdinalIgnoreCase
string OrdinalIgnoreCase
. Using string.ToLowerInvariant()
is probably the best choice, but it violates the language culture. - On the other hand, dynamic URLs --- URLs from a file path that does not match the actual file on the disk (which is displayed in the application) --- you can use
string.ToLower(CultureInfo.CurrentCulture)
, but it violates the file system , and it is somewhat unclear what extreme cases exist that may violate this strategy.
Thus, it looks like code conversion first requires a determination of whether the URL is static or dynamic before choosing one of the two conversion methods. For static URLs, there is uncertainty about how to change case without violating the Windows file system. For dynamic URLs, it is doubtful if case conversion using culture also violates the URL.
Phew! Does anyone have a solution to this mess? Or should I just close my eyes and pretend it's all ASCII?