Efficient way to iterate over a list of files

I am looking for an effective way to iterate over a thousand files in one or more directories.

The only way to iterate over files in a directory is with the File.list*() functions. These functions efficiently load the entire list of files into a collection, and then allow the user to iterate over it. This seems impractical in terms of time / memory consumption. I tried to look at common and other similar tools. but they all end up calling File.list*() somewhere inside. JDK7 walkFileTree() came close, but I can not control when to select the next item.

I have more than 150,000 files in the directory, and after many trial trials of -Xms / -Xmm, I got rid of memory overflow problems. But the time required to fill the array has not changed.

I want to make some Iterable class that uses opendir () / closedir () functions to lazily load file names as needed. Is there any way to do this?

Update:

Java 7 NIO.2 supports file iteration through java.nio.file.DirectoryStream . This is the Iterable class. For JDK6 and below, the only option is the File.list*() methods.

+7
source share
4 answers

Here is an example of how to iterate over directory entries without having to store 159k of them in an array. Add error / exception / shutdown / timeout processing if necessary. This method uses an extra thread to load a small lock queue.

Using:

 FileWalker z = new FileWalker(new File("\\"), 1024); // start path, queue size Iterator<Path> i = z.iterator(); while (i.hasNext()) { Path p = i.next(); } 

Example:

 public class FileWalker implements Iterator<Path> { final BlockingQueue<Path> bq; FileWalker(final File fileStart, final int size) throws Exception { bq = new ArrayBlockingQueue<Path>(size); Thread thread = new Thread(new Runnable() { public void run() { try { Files.walkFileTree(fileStart.toPath(), new FileVisitor<Path>() { public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException { return FileVisitResult.CONTINUE; } public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException { try { bq.offer(file, 4242, TimeUnit.HOURS); } catch (InterruptedException e) { e.printStackTrace(); } return FileVisitResult.CONTINUE; } public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException { return FileVisitResult.CONTINUE; } public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException { return FileVisitResult.CONTINUE; } }); } catch (IOException e) { e.printStackTrace(); } } }); thread.setDaemon(true); thread.start(); thread.join(200); } public Iterator<Path> iterator() { return this; } public boolean hasNext() { boolean hasNext = false; long dropDeadMS = System.currentTimeMillis() + 2000; while (System.currentTimeMillis() < dropDeadMS) { if (bq.peek() != null) { hasNext = true; break; } try { Thread.sleep(1); } catch (InterruptedException e) { e.printStackTrace(); } } return hasNext; } public Path next() { Path path = null; try { path = bq.take(); } catch (InterruptedException e) { e.printStackTrace(); } return path; } public void remove() { throw new UnsupportedOperationException(); } } 
+3
source

This seems impractical in terms of time / memory consumption.

Even 150,000 files will not consume an impractical amount of memory.

I want to make some Iterable class that uses opendir () / closedir () functions to lazily load file names as needed. Is there any way to do this?

You will need to write or find your own code library to access these C functions. This will probably cause more problems than it solves. My advice: just use File.list() and increase the heap size.


Actually, there is another hacker alternative. Use System.exec to run the ls (or the Windows equivalent) and write your iterator to read and parse the text of the command output. This avoids the nasty things associated with using your own Java libraries.

+1
source

Can you group your loads by file type to narrow down batches?

0
source

I'm just wondering why the regular file.list () method returns String [] of file names (and not file.listFiles ()) that consume a lot of memory? Its own call, which simply returns the name of the files. Perhaps u can iterate over it and lazily load any file you need.

0
source

All Articles