Directory recursion and symbolic links

If you walk a directory tree recursively using the obvious method, you will run into problems with infinite recursion when a symbolic link points to the parent directory.

The obvious solution would be to simply check symbolic links and not follow them at all. But this can be an unpleasant surprise for a user who does not expect behavior for other purposes, such as a perfectly normal directory, to be ignored silently.

An alternative solution would be to save the hash table of all directories visited so far, and use this to check the loops. But this will require some canonical representation, somehow getting the identity, of the directory in which you are looking now (regardless of the path by which you reached it).

Do Unix users usually find the second solution less unexpected?

If so, is there a way to get a canonical representation / directory identifier that is portable across Unix systems? (I would like it to work through Linux, BSD, Mac OS, Solaris, etc. I expect you to write separate code for Windows.)

+7
source share
5 answers

The most often ignored API in this field will be

nftw

Nftw has options to avoid crossing symbolic links. It has much more advanced features than this. Here is a simple example from the man page itself:

#define _XOPEN_SOURCE 500 #include <ftw.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <stdint.h> static int display_info(const char *fpath, const struct stat *sb, int tflag, struct FTW *ftwbuf) { printf("%-3s %2d %7jd %-40s %d %s\n", (tflag == FTW_D) ? "d" : (tflag == FTW_DNR) ? "dnr" : (tflag == FTW_DP) ? "dp" : (tflag == FTW_F) ? "f" : (tflag == FTW_NS) ? "ns" : (tflag == FTW_SL) ? "sl" : (tflag == FTW_SLN) ? "sln" : "???", ftwbuf->level, (intmax_t) sb->st_size, fpath, ftwbuf->base, fpath + ftwbuf->base); return 0; /* To tell nftw() to continue */ } int main(int argc, char *argv[]) { int flags = 0; if (argc > 2 && strchr(argv[2], 'd') != NULL) flags |= FTW_DEPTH; if (argc > 2 && strchr(argv[2], 'p') != NULL) flags |= FTW_PHYS; if (nftw((argc < 2) ? "." : argv[1], display_info, 20, flags) == -1) { perror("nftw"); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } 

see also

+4
source

The absolute directory path is this representation. You can get it using the realpath function, which is defined in the POSIX standard, so it will work on any POSIX-compatible system. See man 3 realpath .

+3
source

Not only symbolic links, but also hard links. Not very common, but not prohibited. (Hardlinked root directories only) The only thing that is canonical is {device_number, inode_number}. But network file systems can behave badly.

+2
source

This problem of identical files must be solved by many applications, for example, for checking double files (indentical contents, different names) and utilities that act on entire directory hierarchies, for example tar .

A good implementation will not want to give false positives for hard-linked files and symbolic files, either through symbolic links to parent directories or to files.

The most portable approach to solve this problem is to identify files by looking at the POSIX stat / fstat and struct stat functions that they populate with the st_dev and st_ino . The actual implementation of file checking for duplicates in C using this strategy is the same file (another implementation of which was the 1998 IOCCC winning entry :-)

+2
source

Since you did not specify which language you are working with (if any), start with the shell: if you are on a GNU system with readlink , just use readlink -f <path> for canonization.

If you are on a Mac (which has a non-GNU readlink that behaves differently), see How can I get readlink -f GNU behavior on a Mac? to complete the same task.

Another option is to use inode identifiers to track unique files (via stat or similar), but in any case it will require first of all all symbolic links (since the symbolic links themselves have their own unique inode identifier) ​​and the easiest way to follow all symbolic links, well readlink .


Alternatively, many programming languages ​​have bindings to the POSIX realpath function, which essentially performs the same function as readlink -f (but as a library call). For example, Python has os.path.realpath() , C has it as a function in stdlib.h , etc.

If you already work in a language with such a function, it is recommended to use it, since you often get cross-platform compatibility for free (provided that your language is cross-platform).

+1
source

All Articles