What sequence of characters should I not indicate in the file name?

I found out after testing that linux allows any character in the file name except / and null ( \0 ). So, what sequence can I not allow in the file name? I heard that the host may confuse some command-line programs, which doesn’t matter to me, but it may bother other people if they decide to collect a bunch of files and filter them using some GNU programs.

I was asked to remove leading and trailing spaces, and I plan only because usually the user does not mean to have a leading / trailing space.

What could be the problematic sequence and which sequence should I consider unacceptable? I am also considering the inadmissibility of characters in windows just for convenience. I think I cannot allow dashes at the beginning (the dash is a symbol of the legal window)

+19
command-line linux filenames
Feb 20 '10 at 23:46
source share
6 answers

Your question is somewhat confusing as you talk in detail about Linux, but then in the comment to another answer you say that you generate file names for downloading people, which presumably means that you have absolutely no control over the file system and that files will be saved, which makes Linux completely inappropriate.

For the purposes of this answer, I am going to assume that your question is incorrect and your comment is correct.

The vast majority of operating systems and file systems in use today fall into approximately three categories: POSIX, Windows, and MacOS.

The POSIX specification makes it very clear what the file name looks like, which is guaranteed to be portable across all POSIX systems. The characters you can use are defined in Section 3.276 (portable file character set) of the Open Group core specification as:

  ABCDEFGHIJKLMNOPQRSTUVWXYZ
 abcdefghijklmnopqrstuvwxyz
 0123456789 ._- 
The maximum file name the length you can rely on is defined in Section 13.23.3.5 ( <limits.h> Minimum Values) as 14 . (The corresponding constant is _POSIX_NAME_MAX .)

So, the file name is up to 14 characters long and contains only 65 characters listed above, it is safe to use in all POSIX compatible systems, which gives you combinations 24407335764928225040435790 (or about 84 bits).

If you do not want to annoy your users, you must add two more restrictions: do not start the file name with a dash or dot. File names starting with a period are usually interpreted as β€œhidden” files and will not appear in directory lists unless they are explicitly requested. And file names starting with a dash can be interpreted as an option by many commands. (Sidenote: it's amazing how many users don't know about the rm ./-rf or rm -- -rf .)

This leaves you in 23656340818315048885345458 combinations (another 84 bits).

Windows adds a couple of new restrictions to this: file names cannot end with a period, and file names are not case sensitive. This reduces the character set from 65 to 39 characters (37 for the first, 38 for the last character). It does not add length restrictions; Windows can only process 14 characters.

This reduces the possible combinations to 17866587696996781449603 (73 bits).

Another limitation is that Windows treats everything after the last dot as a file name extension that denotes a file type. If you want to avoid potential confusion (say, if you create a file name, like abc.mp3 for a text file), you should avoid dots in general.

You still have combinations 13090925539866773438463 (73 bits).

If you need to worry about DOS, then additional restrictions apply: the file name consists of one or two parts (separated by a dot), where none of the two parts can contain a dot. The first part has a maximum length of 8, the second - 3 characters. Again, the second part is usually reserved to indicate the type of file that leaves you with only 8 characters.

You now have 4347792138495 possible file names or 41 bits.

The good news is that you can use the 3-character extension to actually indicate the file type correctly without violating the POSIX file name limit (8 + 3 + 1 = 12 and 14).

If you want your users to be able to write files to CD-Rs formatted in accordance with ISO9660 Level 1, you need to disable the hyphen anywhere, and not just as the first character. Now the remaining character set looks like

  ABCDEFGHIJKLMNOPQRSTUVWXYZ
 0123456789_ 
which gives you combinations of 3512479453921 (41 bits).
+67
Feb 21 2018-10-21
source share

I would leave the definition of "valid" to the OS driver and file system. Let the user enter whatever he wants and pass it on. Handle errors from the OS accordingly. The exception is that I consider it reasonable to separate leading and trailing spaces. If people want to create file names with embedded spaces or leading dashes or question marks, and their selected file system allows this, you should not try to prevent them.

It is possible to mount various file systems at different connection points (or drives in Windows), which have different rules regarding legal characters in the file name. Processing this kind of thing inside your application will be much more efficient than necessary, because the OS will already do it for you.

+6
Feb 20 '10 at 23:51
source share

Since it seems to you that you are primarily interested in Linux, one thing to avoid is characters that the (typical) shell will try to interpret, for example, as a wildcard. You can create a file called "*" if you insist, but you may have some users who do not really appreciate it.

+5
Feb 20 '10 at
source share

Are you developing an application in which you should ask the user to create files themselves? If this is what you are doing, you can set the rules in your application. (for example, enable [a-zA-Z0-9_.] and reject the rest of the special characters.) This is much easier to enforce.

+3
Feb 21 '10 at 0:03
source share

urlencode is all the lines that will be used as file names, and you only need to worry about the length. This answer may be worth a read.

0
Feb 20 '10 at 23:57
source share

I would recommend using a set of whitelists. In general, characters in file names will annoy people.

In any case, people can use az 0-9 and Unicode characters> 0x80, but do not allow arbitrary characters like and, and will cause a lot of irritation, as well as stop stops in inappropriate places.

I think ASCII characters that can be safely resolved: fullstop underscore hyphen

Resolving any OTHER ascii characters in the file name is causing problems.

The file name should also not begin with the ascii character. The policy in spaces in file names is complicated, as users can rely on their use, but some file names are obviously dumb (such as START with spaces)

0
Feb 21 '10 at 11:41
source share



All Articles