Translate URL to a valid file name and return to URL

I need to keep some information that is unique to every site that my users access. (This is actually a thumbnail of the site he was looking at.)
This thumbnail (jpeg file) must have a name indicating which site it represents so that it can be viewed later.

Can you recommend a simple translation from url to the correct file name and vice versa?

Example: www.ibm.com can be compared to www_ibm_com .

I'm not sure if this will always work with all valid URLs, in some cases the URLs have very complex query strings.

Is there a good regex library or C # that can be used?

Thanks in advance and we will be happy.

+4
source share
2 answers

Firstly, it is worth noting that "." it is completely legal in file names, but "/" is not, therefore, as long as the example you provided does not require translation, "www.ibm.com/path1/file1.jpg" will be.

A simple string.Replace would be the best solution here - if you find that a character that is legal in the file name but illegal in the URL.

Assuming the illegal URL is β€œΒ§β€ (which may be legitimate at the URL), you have received:

 string.Replace("/", "Β§"); 

to translate to a file name and:

 string.Replace("Β§", "/"); 

to translate back.

This URL-encoded page defines valid, invalid, and insecure (valid, but with a special meaning) characters for URLS. The characters in the "upper half" of the ISO-Latin set of 80-FF hex (128-255 decimal numbers) are not legal, but may be in order in the file names.

You will need to do this for each character in the URL that is in the set of invalid file names. You can get this using GetInvalidFileNameChars .

UPDATE

Assuming you cannot find matching character pairs, then another solution would be to use a lookup table. One column contains the URL of another generated file name. As long as the generated name is unique (the GUID will do), you can do a two-way search to go from one to the other.

+2
source

www.ibm.com is really a valid file name. More problematic are slashes. Therefore, if the URL contains subdirectories, you will need to translate the slashes.

The main problem then is the possible duplicates. For example, both ibm.com/path1_path2 and ibm.com/path1/path2 will be translated to the same value.

I like ChrisF’s suggestion to find a character that is legal in file names, but not in URLs, although I don’t even know which character, if any, will not be in my head.

If you do not find such a character, you may need to stick with an unlikely character.

+1
source

All Articles