Quick sort by date huge file ArrayList

I have an ArrayList in Java with a huge number of files (~ 40,000 files). I need to sort these files in ascending / descending order by date. I am currently using simple

Collections.sort(fileList, new FileDateComparator()); 

where is FileDateComparator

 public class FileDateComparator implements Comparator<File> { @Override public int compare(File o1, File o2) { if(o1.lastModified() < o2.lastModified()) return -1; if(o1.lastModified()==o2.lastModified()) return 0; return 1; } } 

Sorting takes too much time for me, for example, 20 seconds or more. Is there a more efficient way to implement this? I already tried using Apache I / O LastModifiedFileComparator as a comparator, but it seems to be implemented in the same way since it takes the same time.

+6
source share
4 answers

I think you need to cache the modification time in order to speed this up. You can, for example, try something like this:

 class DatedFile { File f; long moddate; public DatedFile(File f, long moddate) { this.f = f; this.moddate = moddate; } }; ArrayList<DatedFile> datedFiles = new ArrayList<DatedFile>(); for (File f: fileList) { datedFiles.add(new DatedFile(f, f.lastModified())); } Collections.sort(fileList, new FileDateComparator()); ArrayList<File> sortedFiles = new ArrayList<File>(); for (DatedFile f: datedFiles) { sortedFiles.add(ff); } 

(with appropriate implementation of FileDateComparator)

+4
source

Sorting is O (n lg N), so your list of 40,000 files will need about 600,000 operations (comparisons). If it takes about 20 seconds, that's about 30,000 comparisons per second. Therefore, each comparison takes about 100,000 measures. This is not possible due to processor processing. Sorting is almost certainly related to I / O binding, not CPU binding. Disk queries are especially expensive.

You may be able to reduce time by using multithreading to reduce the impact of disk accesses. That is, having several reads in the queue and expecting the drive to provide its data. To do this, use a (parallel) card that displays the file names during modification and populates this card with multiple threads. Then try using the sort method rather than using File.lastModified() .

Even if you populate this map with only one thread, you will get little benefit, because your sorting method will use local caching time, rather than request O / S each time for modification time. The advantage of such caching can be small, since O / S itself can cache this information.

+2
source

The Java array .sort () is (roughly from Java 6) actually TimSort [ http://svn.python.org/projects/python/trunk/Objects/listsort.txt ], the fastest assignment is #sort out there (much better than qsort in many situations); You cannot sort anything noticeably faster without heuristics.

"like 20 seconds or more, it means that your problem is probably known ApplicationProfilingSkippedByDeveloperException - performs profiling and determines the exact bottleneck. I would go with the OS I / O files as one; doing my own request for file attributes in batch mode, caching the results, and then processing them right away, seems like the only reasonable solution here.

+1
source

You need to cache lastModified (). One way to do this is through the comparator itself.

 public class FileDateComparator implements Comparator<File> { Map<File, Long> lastModifiedMap = new HashMap<>(); Long lastModified(File f) { Long ts = lastModifiedMap.get(f); if (ts == null) lastModifiedMap.put(f, ts = f.lastModified()); return ts; } @Override public int compare(File f1, File f2) { return lastModified(f1).compareTo(lastModified(f2)); } } 

This will improve performance by only viewing the modified date of each file once.

0
source

All Articles