How can I safely encode a string in Java to use as a file name?

I get a string from an external process. I want to use this line to create a file name and then write to this file. Here is my code snippet to do this:

String s = ... // comes from external source File currentFile = new File(System.getProperty("user.home"), s); PrintWriter currentWriter = new PrintWriter(currentFile); 

If s contains an invalid character, such as "/" on a Unix-based OS, then java.io.FileNotFoundException is thrown (correctly).

How can I safely encode a string so that it can be used as a file name?

Edit: what I hope for is an API call that does this for me.

I can do it:

  String s = ... // comes from external source File currentFile = new File(System.getProperty("user.home"), URLEncoder.encode(s, "UTF-8")); PrintWriter currentWriter = new PrintWriter(currentFile); 

But I'm not sure if URLEncoder is reliable for this purpose.

+83
java string file encoding
Jul 26 '09 at 9:54
source share
10 answers

If you want the result to be similar to the source file, SHA-1 or any other hash scheme is not the answer. If collisions should be avoided, then simply replacing or deleting the โ€œbadโ€ characters is also not the answer.

Instead, you want something like this.

 char fileSep = '/'; // ... or do this portably. char escape = '%'; // ... or some other legal char. String s = ... int len = s.length(); StringBuilder sb = new StringBuilder(len); for (int i = 0; i < len; i++) { char ch = s.charAt(i); if (ch < ' ' || ch >= 0x7F || ch == fileSep || ... // add other illegal chars || (ch == '.' && i == 0) // we don't want to collide with "." or ".."! || ch == escape) { sb.append(escape); if (ch < 0x10) { sb.append('0'); } sb.append(Integer.toHexString(ch)); } else { sb.append(ch); } } File currentFile = new File(System.getProperty("user.home"), sb.toString()); PrintWriter currentWriter = new PrintWriter(currentFile); 

This solution provides reversible encoding (no collisions), where the encoded strings in most cases resemble the original strings. I assume you are using 8 bit characters.

URLEncoder works, but it has the disadvantage that it encodes a lot of logical characters in the file name.

If you want a non-guaranteed one to be a reversible solution, simply remove the โ€œbadโ€ characters, and not replace them with escape sequences.

+10
Jul 26 '09 at 10:52
source share

My suggestion is to adopt a whitelist approach, that is, do not try to filter out bad characters. Instead, determine what is in order. You can either reject the file name or filter it. If you want to filter it:

 String name = s.replaceAll("\\W+", ""); 

What this does is replace any character that is not a number, letter or underscore with anything. Alternatively, you can replace them with another character (for example, underscore).

The problem is that if this is a shared directory, you do not need file name collisions. Even if user-defined storage areas are user-separated, you may encounter a colliding file name by simply filtering out bad characters. The name the user enters is often useful if he ever wants to download it.

For this reason, I try to allow the user to enter what they want, store the file name based on the scheme I selected (for example, userId_fileId), and then save the user file name in the database table. In this way, you can display it back to the user, store whatever you want, and not compromise security or destroy other files.

You can also hash a file (for example, an MD5 hash file), but then you cannot list the files that the user has attached (without a meaningful name).

EDIT: fixed regex for java

+81
Jul 26 '09 at 10:04
source share

It depends on whether the encoding should be reversible or not.

Reversible

Use URL encoding ( java.net.URLEncoder ) to replace special characters with %xx . Note that you take care of special cases where the string is equal . is .. or empty! ยน Many programs use URL encoding to create file names, so this is a standard technique that everyone understands.

Irreversible

Use the hash (e.g. SHA-1) of this string. Modern hashing algorithms ( not MD5) can be considered messy. In fact, if you find a collision, you will get a breakthrough in cryptography.




<Sub> ยน You can handle all 3 special cases elegantly using a prefix like "myApp-" . If you put the file directly in $HOME , you still have to do this to avoid conflicts with existing files such as ".bashrc".
 public static String encodeFilename(String s) { try { return "myApp-" + java.net.URLEncoder.encode(s, "UTF-8"); } catch (java.io.UnsupportedEncodingException e) { throw new RuntimeException("UTF-8 is an unknown encoding!?"); } } 
+25
Jul 26 '09 at 9:59
source share

Here is what I use:

 public String sanitizeFilename(String inputName) { return inputName.replaceAll("[^a-zA-Z0-9-_\\.]", "_"); } 

What this does is replace every character that is not a letter, number, underline, or underline, using a regular expression.

This means that something like "How to convert ยฃ to $" will become "How_to_convert___to__". Admittedly, this result is not very user-friendly, but it is safe, and the resulting file / file names are guaranteed to work everywhere. In my case, the result is not displayed to the user and, therefore, is not a problem, but you can change the regular expression to a more permissive one.

It is worth noting that the other problem I ran into was that sometimes I get identical names (since they are based on user input), so you should be aware of this, since you cannot have multiple directories / files with the same name in one directory. In addition, you may need to truncate or otherwise shorten the resulting string, as it may exceed the 255-character limit that some systems have.

+12
Jan 10 '16 at 21:26
source share

For those who are looking for a common solution, these may be general criteria:

  • The file name should resemble a string.
  • If possible, the encoding should be reversible.
  • Collision probability should be minimized.

To do this, we can use a regular expression to match invalid percent-encode characters, and then limit the length of the encoded string.

 private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-]"); private static final int MAX_LENGTH = 127; public static String escapeStringAsFilename(String in){ StringBuffer sb = new StringBuffer(); // Apply the regex. Matcher m = PATTERN.matcher(in); while (m.find()) { // Convert matched character to percent-encoded. String replacement = "%"+Integer.toHexString(m.group().charAt(0)).toUpperCase(); m.appendReplacement(sb,replacement); } m.appendTail(sb); String encoded = sb.toString(); // Truncate the string. int end = Math.min(encoded.length(),MAX_LENGTH); return encoded.substring(0,end); } 

Patterns

The above pattern is based on a conservative subset of the allowed characters in the POSIX specification .

If you want to allow the dot character, use:

 private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-\\.]"); 

Just be careful with strings like "." and ".."

If you want to avoid collisions on case insensitive file systems, you need to avoid capital:

 private static final Pattern PATTERN = Pattern.compile("[^a-z0-9_\\-]"); 

Or skip the lowercase letters:

 private static final Pattern PATTERN = Pattern.compile("[^a-z0-9_\\-]"); 

Instead of using a whitelist, you can choose a blacklist of reserved characters for your specific file system. EG. This regular expression is suitable for FAT32 file systems:

 private static final Pattern PATTERN = Pattern.compile("[%\\.\"\\*/:<>\\?\\\\\\|\\+,\\.;=\\[\\]]"); 

Length

In Android, 127 characters is a safe limit. Many file systems accept 255 characters.

If you prefer to hold the tail rather than the head of your line, use:

 // Truncate the string. int start = Math.max(0,encoded.length()-MAX_LENGTH); return encoded.substring(start,encoded.length()); 

Decoding

To convert the file name to the source string, use:

 URLDecoder.decode(filename, "UTF-8"); 

Limitations

Since longer strings are truncated, there is the possibility of name collision during encoding or corruption during decoding.

+11
Feb 23 '14 at 23:25
source share

Choose your poison from the parameters represented by the public codec , for example:

 String safeFileName = DigestUtils.sha(filename); 
+4
Jul 26 '13 at 19:26
source share

Try the following regular expression, which replaces any character with an invalid file name with a space:

 public static String toValidFileName(String input) { return input.replaceAll("[:\\\\/*\"?|<>']", " "); } 
+4
Feb 01 '15 at 23:26
source share

This is probably not the most efficient way, but shows how to do it using Java 8 pipelines:

 private static String sanitizeFileName(String name) { return name .chars() .mapToObj(i -> (char) i) .map(c -> Character.isWhitespace(c) ? '_' : c) .filter(c -> Character.isLetterOrDigit(c) || c == '-' || c == '_') .map(String::valueOf) .collect(Collectors.joining()); } 

The solution can be improved by creating a custom collector that uses StringBuilder, so you don't need to throw every light character in a heavy string.

+1
Jul 13 '15 at 11:43
source share

You can remove invalid characters ('/', '\', '?', '*') And then use it.

0
Jul 26 '09 at 9:58
source share

Just use:

IOHelper.toFileSystemSafeName ("Iblabla / blabla");

will turn into "Iblablablabla"

-one
Jun 23 '13 at 16:54
source share



All Articles