How to get directory listing FAST in Java?

Assume a very simple program that lists all the subdirectories of this directory. Sounds simple enough? In addition, the only way to list all subdirectories in Java is to use FilenameFilter in combination with File.list () .

This works for a trivial case, but when the folder says 150,000 files and 2 subfolders, it stupidly waits there for 45 seconds, iterating all the files and testing the file.isDirectory () file. Is there a better way to list subdirectories?




PS. Sorry, please save the lecture that there are too many files in one directory. Our living environment has this as part of the demand.

+18
java performance filesystems file-io
Jun 23 '09 at 20:26
source share
13 answers

As already mentioned, this is mostly a hardware problem. Disk access is always slow, and most file systems are not designed to handle directories with so many files.

If for some reason you have to store all the files in one directory, I think you will need to maintain your own cache. This can be done using a local database such as sqlite, HeidiSQL or HSQL. If you want maximum performance, use java TreeSet and cache it in memory. This means, at the very least, that you will have to read the directory less frequently, and this can be done in the background. You could reduce the need to update the list even further by using your own inotify on linux file update notification API to subscribe to directory changes.

This does not seem possible to you, but once I solved a similar problem by "hashing" files into subdirectories. In my case, the task was to store several million images with numerical identifiers. I built the directory structure as follows:

images/[id - (id % 1000000)]/[id - (id % 1000)]/[id].jpg 

This worked well for us, and this is a solution I would recommend. You could do something like alpha-numeric file names by simply taking the first two letters of the file name and then the next two letters. I did this too once, and he also did this work.

+10
Jun 23 '09 at 21:20
source share

Do you know the final list of possible subdirectory names? If so, use a loop for all possible names and verify that the directory exists.

Otherwise, you cannot ONLY get the directory names in most base operating systems (for example, on Unix, the list of directories is just reading the contents of the "directory" file, so there is no way to quickly find "just directories" without specifying all the files).

However, in NIO.2 in Java7 (see http://java.sun.com/developer/technicalArticles/javase/nio/#3 ), there is a way to get a list of stream directories, t get a full array of file elements cluttering your memory / network.

+7
Jun 23 '09 at 20:46
source share

Actually there is a reason why you received lectures: this is the correct answer to your problem. Here is the background so that maybe you can make some changes to your living environment.

First: directories are stored in the file system; think of them as files, because that’s exactly what they are. When you iterate through a directory, you must read these blocks from disk. For each entry in the directory, sufficient space is required to store the file name and permissions, as well as information about where this file is located on disk.

Secondly: directories are not saved with any internal ordering (at least not in the file systems where I worked with the directory files). If you have 150,000 entries and 2 subdirectories, these 2 links to subdirectories can be within 150,000. You have to iterate to find them, there is no way around this.

So, let's say that you cannot avoid a large directory. The only real option is to try to save the blocks containing the catalog file in the cache in memory so that you do not get to the disk every time you access them. You can achieve this by regularly repeating the directory in the background thread, but this will overload your disks and interfere with other processes. In addition, you can scan once and track the results.

An alternative is to create a multi-level directory structure. If you look at commercial websites, you will see URLs such as /1/150/15023.html - this means that the number of files in the directory is less. Think of it as the BTree index in a database.

Of course, you can hide this structure: you can create a file system abstraction layer that accepts file names and automatically generates a directory tree where these file names can be found.

+5
Jun 23 '09 at 20:44
source share

I don’t know if there is enough overhead for trimming cmd.exe , but one possibility might be something like this:

 ... Runtime r = Runtime.getRuntime(); Process p = r.exec("cmd.exe /k dir /s/b/ad C:\\folder"); BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream())); for (;;) { String d = br.readLine(); if (d == null) break; System.out.println(d); } ... 
  • / s means subdirectory search
  • / ad means only return directories
  • / b means returning the full path from the root
+4
Jun 23 '09 at 20:59
source share

You can crack it if all 150k files (or a significant number of them) have a similar naming convention, for example:

 *.jpg *Out.txt 

and only actually create file objects for those that you are not sure that you are a folder.

+3
Jun 23 '09 at 20:31
source share

A key issue might be the File.isDirectory () function called in a loop.

File.isDirectory () can be very slow. I saw that NFS takes 10 seconds to process a directory of 200 files.

If you can by all means prevent calls to File.isDirectory () (for example, a test for the extension, a directory with the extension ==), you could significantly improve performance.

Otherwise, I would suggest making JNA / JNI / write a native script that will do this for you.

The jCifs library allows you to more effectively manage Windows network resources. I do not know about a library that will do this for other network file systems.

+3
Nov 06 '09 at 15:57
source share

if your OS is stable try JNA :

all this is a "streaming API". They do not force you to select a list / array of 150k before starting the search. IMHO this is a big advantage in your scenario.

+2
Jun 23 '09 at 21:08
source share

there is also a recursive parallel scan at http://blogs.oracle.com/adventures/entry/fast_directory_scanning . Essentially brothers and sisters are processed in parallel. Performance tests are also encouraged there.

+1
Jun 23 '09 at 21:28
source share

This is a non-standard solution, and no tests at all. It also depends on the existence of a file system that supports symbolic links. This is not a Java solution. I suspect your problem is with the file system and OS, not with Java.

Is it possible to create a parallel directory structure with subdirectories based on the initial letters of the file names and then symbolically link to the real files? Illustration

 /symlinks/a/b/cde 

will refer to

 /realfiles/abcde 

(where / realfiles is where your 150,000 files are)

You will need to create and maintain this directory structure, and I do not have enough information to determine how practical this is. But above would be to create a fast (er) index into your non-hierarchical (and slow) directory.

+1
Jun 23 '09 at 21:31
source share

Perhaps you could write a directory finder in C # / C / C ++ and use JNI to get it in Java. I don't know if this will improve performance or not.

0
Jun 23 '09 at 20:37
source share

In this case, you can try some JNA solution - a platform-specific directory tracer (FindFirst, FindNext on Windows) with the possibility of some iteration pattern. In addition, Java 7 will have much better file system support, it is worth checking the specifications (I do not remember any features).

Edit: Idea: One option is to hide the slowness of the directory listing from the user's eyes. In a client-side application, you can use some animation while the list is running to distract the user. In fact, it depends on what else your application does next to the listing.

0
Jun 23 '09 at 20:38
source share

Well, either JNI, or if you say that your deployment is permanent, just run "dir" on Windows or "ls" on * nixes, with the appropriate flags, to list directories only (Runtime.exec ())

0
Jun 23 '09 at 20:44
source share

I came across a similar issue when debugging performance in a Java application listing a large number of files. He uses the old approach

 for (File f : new File("C:\\").listFiles()) { if (f.isDirectory()) { continue; } } 

And it seems that every f.isDirectory () is a call to the native FileSsystem, which, at least on NTFS, is very slow. Java7 NIO has an additional API, but not all methods are good there. I just provided the JMH test result here.

 Benchmark Mode Cnt Score Error Units MyBenchmark.dir_listFiles avgt 5 0.437 ? 0.064 s/op MyBenchmark.path_find avgt 5 0.046 ? 0.001 s/op MyBenchmark.path_walkTree avgt 5 1.702 ? 0.047 s/op 

The number comes from the execution of this code:

 java -jar target/benchmarks.jar -bm avgt -f 1 -wi 5 -i 5 -t 1 static final String testDir = "C:/Sdk/Ide/NetBeans/src/dev/src/"; static final int nCycles = 50; public static class Counter { int countOfFiles; int countOfFolders; } @Benchmark public List<File> dir_listFiles() { List<File> files = new ArrayList<>(1000); for( int i = 0; i < nCycles; i++ ) { File dir = new File(testDir); files.clear(); for (File f : dir.listFiles()) { if (f.isDirectory()) { continue; } files.add(f); } } return files; } @Benchmark public List<Path> path_walkTree() throws Exception { final List<Path> files = new ArrayList<>(1000); for( int i = 0; i < nCycles; i++ ) { Path dir = Paths.get(testDir); files.clear(); Files.walkFileTree(dir, new SimpleFileVisitor<Path> () { @Override public FileVisitResult visitFile(Path path, BasicFileAttributes arg1) throws IOException { files.add(path); return FileVisitResult.CONTINUE; } @Override public FileVisitResult preVisitDirectory(Path path, BasicFileAttributes arg1) throws IOException { return path == dir ? FileVisitResult.CONTINUE : FileVisitResult.SKIP_SUBTREE; } }); } return files; } @Benchmark public List<Path> path_find() throws Exception { final List<Path> files = new ArrayList<>(1000); for( int i = 0; i < nCycles; i++ ) { Path dir = Paths.get(testDir); files.clear(); files.addAll(Files.find(dir, 1, (path, attrs) -> true /*!attrs.isDirectory()*/).collect(Collectors.toList())); } return files; } 
0
Sep 14 '16 at 17:20
source share



All Articles