Canonicalize lowercase url without breaking file system or culture?

Canonicalize lowercase URLs

I want to write an HTTP module that converts URLs to lowercase. My first attempt ignored international character sets and works fine:

// Convert URL virtual path to lowercase string lowercase = context.Request.FilePath.ToLowerInvariant(); // If anything changed then issue 301 Permanent Redirect if (!lowercase.Equals(context.Request.FilePath, StringComparison.Ordinal)) { context.Response.RedirectPermanent(...lowercase URL...); } 

Test in Turkey (international cultures):

But what about other cultures than in the US? I referenced Turkey Test to find a test URL:

 http://example.com/Iıİi 

This little insidious stone destroys any idea that converting code into URLs is easy! Its lowercase and upper case, respectively, are as follows:

 http://example.com/ııii http://example.com/IIİİ 

To convert the case to work with Turkish URLs, I first had to set the current ASP.NET culture to Turkish:

 <system.web> <globalization culture="tr-TR" /> </system.web> 

Then I had to change my code to use the current culture to convert case:

 // Convert URL virtual path to lowercase string lowercase = context.Request.FilePath.ToLower(CultureInfo.CurrentCulture); // If anything changed then issue 301 Permanent Redirect if (!lowercase.Equals(context.Request.FilePath, StringComparison.Ordinal)) { context.Response.RedirectPermanent(...); } 

But wait! Will StringComparison.Ordinal still work? Or should I use StringComparison.CurrentCulture ? I'm not sure either!

File Names: It Gets MUCH WORSE!

Even if the above works, using the current conversion culture in the event of an NTFS file system break! Let's say I have a static file called Iıİi.html :

 http://example.com/Iıİi.html 

Although the Windows file system is not case sensitive, it does not use a language culture. Converting the above URL to lowercase results in a 404 error because the file system does not consider these two names equal:

 http://example.com/ııii.html 

Correct case conversion for file names? Who knows?!

The MSDN article Recommendations for Using Strings in the .NET Framework has a note (about halfway to the article):

Note: The string behavior of the file system, registry keys and values, and environment variables is best represented in StringComparison.OrdinalIgnoreCase.

BUT? The best are presented ??? Is this the best we can do in C #? So what is the correct case conversion to fit the file system? Who knows?!!? . All we can say is that comparing strings using the above is likely to work the MOST of that time.

Summary: Converting two cases: static / dynamic URLs

  • So we saw that static URLs --- URLs that have a file path that matches the real directory / file in the file system --- must use an unknown case transformation that is just the “best represented” StringComparison.OrdinalIgnoreCase . And note that there is no string.ToLowerOrdinal() method, so it’s very difficult to know exactly which case conversion corresponds to OrdinalIgnoreCase string OrdinalIgnoreCase . Using string.ToLowerInvariant() is probably the best choice, but it violates the language culture.
  • On the other hand, dynamic URLs --- URLs from a file path that does not match the actual file on the disk (which is displayed in the application) --- you can use string.ToLower(CultureInfo.CurrentCulture) , but it violates the file system , and it is somewhat unclear what extreme cases exist that may violate this strategy.

Thus, it looks like code conversion first requires a determination of whether the URL is static or dynamic before choosing one of the two conversion methods. For static URLs, there is uncertainty about how to change case without violating the Windows file system. For dynamic URLs, it is doubtful if case conversion using culture also violates the URL.

Phew! Does anyone have a solution to this mess? Or should I just close my eyes and pretend it's all ASCII?

+7
source share
3 answers

You have incompatible goals.

  • You have a culture-sensitive case of decline. If the Turkish language seems bad, you do not want to know about some Georgian scenarios, it does not matter that ß has upper and lower limits to SS or less often - SZ - in any case, to have the full case -folding, where lower("ß") will be match lower(upper("ß")) , you need to consider it equivalent to at least one of these two-character sequences. As a rule, we strive to expand cases, and not to reduce cases, if possible (here it is impossible).

  • Use this in a non-cultural context. URIs are ultimately opaque strings. That they can have a human-readable understanding is useful for coders, users, search engines and marketers, but their ultimate goal is to identify the resource by direct comparison with the register.

  • Compare this to NTFS, which has case sensitivity based on comparisons in the $ UpCase file, which it does by comparing upper forms of words (at least it does not have to determine whether Σ lower case to σ or ς insensitive to culture.

  • Suppose to succeed in terms of SEO and human readability. This may be part of your original goal, but whileThisIsNotVeryEasyToReadOrParse itseasierforbothpeopleandmachinesthanthis. Dumping data loses information.

I suggest a different approach.

  • Start with your starting line, no matter what it is or where it came from (NTFS file name, database record, HttpHandler binding in web.config). Have it as your canonical form. By all means, there are rules that people create these lines in accordance with the canonical form and possibly apply it wherever you can, but if something slips that violates your rules, then take this as the official canonical name for of this resource, you do not like it.

  • As far as possible, the canonical name should be the only one "seen" by the outside world. This can be forcibly implemented programmatically or simply the question of what is the best practice, since canonicalising after the fact with the 301s will not solve the fact that external entities do not know that you do this until they play out the URI.

  • When the request is received, check it according to how it will be used. Therefore, although you can use a certain culture (or not) for those cases when you yourself are searching for resources with the so-called “static” URIs, your logic can consciously follow NTFS, just using NTFS to do the job:

    • Find the displayed file, ignoring the case sensitivity issue.
    • If it does not match 404, who is interested in the case?
    • If find, do a random comparison of ordinals if it doesn't match 301 random mapping.
    • Otherwise, proceed as usual.

Edit:

In a sense, the domain name issue is more complex. Rules for IDNs should cover more issues with less room for maneuver. However, it is also simpler, at least as long as the canonization situation does not go away.

(I'm going to ignore the canonization of whether www. Etc. is used or not, although I would suggest that this is part of the same work here, it pushes the scope, and we could write a book between us if we don't stop anywhere :)

The IDN has its own canonicalization rules (and some other forms of normalization) defined in RFC 3491. If you intend to canonicalize domain names in the event, follow this.

Makes this a pleasant and simple answer, right? :)

There is also less pressure in the way, as search engines must recognize that http://example.net/thisisapath and http://example.net/thisisapath can be the same resource, they must also recognize that they can be different, and that where all SEO the benefit of canonicalization on one of them (no matter what) comes from.

However, they know that example.net and example.net cannot be different sites, so the small advantage of SEO is to make sure they are the same (still good for things like caches and history lists that don't do it jump yourself). Of course, the problem remains that www.example.net or even maAndPasExampleEmporium.us can be one and the same site, but again, that moves away from problems with situations.

In addition, it’s a simple matter that in most cases we don’t have to deal with more than a few dozen different domains, so sometimes they work harder and not smarter (i.e. just make sure they are all set up correctly, t do something either programmatically!) can do the trick.

Finally, it’s important not to canonize a third-party URI. You can end up breaking things if you change the path (they may not be case-sensitive), and you may at least end up breaking their slightly different canonization. It’s better to leave them as they always are.

+1
source

I would question the premise here that there is any usefulness in trying to automatically convert URLs to lowercase.

Whether the full URL is case sensitive or not entirely dependent on the web server, web application structure, and underlying file system.

In the schema (http: // etc.) and the hostname of the URL, you guarantee case insensitivity. And remember that not all URL schemes ( file and news , for example) even contain a host name.

Everything else can be case sensitive on the server, including paths ( / ), file names, requests ( ? ), Fragments ( # ) and credentials (usernames / passwords before @ in mailto , http , ftp and some other schemes) .

+4
source

First, never use conversions to compare strings. It is uselessly highlighting a line, has an unnecessary small impact on performance, can lead to an ObjectReferenceException if the value is null and can lead to an incorrect comparison.

If this is important enough for you, I would manually move the file system and use your own comparisons with each file / directory name. You should be able to use the Accept-Language or Accept-Encoding (if it has a culture) HTTP header to find the appropriate culture to use. If you have CultureInfo , you can use it to perform string comparisons:

 var ci = CultureInfo.CurrentCulture; // Use Accept-Language to derive this. ci.CompareInfo.Compare("The URL", "the url", CompareOptions.IgnoreCase); 

I would only do this on HTTP 404; the HTTP 404 handler will look for a suitable file, and then the HTTP 301 user will correctly URL (since bypassing a manual file system can become expensive).

0
source

All Articles