Millions of small graphic files and ways to overcome slow access to the file system on XP

I process millions of fragments that will appear as overlays on Google Maps. Files are created by GMapCreator from the Center for Advanced Spatial Analysis at University College London. The application displays files in one folder at a time, in some cases I need to create about 4.2 million fragments. I run it on Windows XP using the NTFS file system, the disk is 500 GB and formatted using the default operating system options.

I find that rendering the fragments becomes slower and slower as the number of fragments displayed increases. I also saw that if I try to look at folders in Windows Explorer or using the command line, the whole machine will be effectively locked for several minutes before it recovers enough to do something again.

I split the input shapefiles into smaller parts, working on different machines, etc., but the problem still causes me significant pain. I wondered if the cluster size on my disk could prevent this issue or should I use a different file system altogether. Does anyone have any ideas how I can solve this problem?

Thanks,

Barry.

Update:

Thanks everyone for the suggestions. A possible solution was to write a piece of code that controlled the GMapCreator output folder, moving files to a directory hierarchy based on their file names; therefore, a file named abcdefg.gif will be moved to \ a \ b \ c \ d \ e \ f \ g.gif. Doing this at the same time as GMapCreator overcame file system performance issues. The hint about creating DOS 8.3 file names was also very useful - as noted below, I was amazed at the difference. Greetings :-)

+4
source share
5 answers

There are a few things you could / should do

  • Disable automatic creation of short NTFS file name (google it)
  • Or restrict the file names to use the 8.3 pattern (e.g. i0000001.jpg, ...)

  • In any case, try to make the first six characters of the file name as unique / different as possible.

  • If you use the same folder and (say, add a file, delete a file, files to read, ...)

    • Use contig to keep the directory index file as fragmented as possible (check this for explanation)
    • In particular, when deleting many files, it is recommended to use the delete trick folder to reduce the size of the direcotry index file
  • As already published, consider the separation of files in several directories.

.eg instead

directory/abc.jpg directory/acc.jpg directory/acd.jpg directory/adc.jpg directory/aec.jpg 

using

 directory/b/c/abc.jpg directory/c/c/acc.jpg directory/c/d/acd.jpg directory/d/c/adc.jpg directory/e/c/aec.jpg 
+4
source

Use more folders and limit the number of entries in any folder. The time to list the number of entries in a directory increases (exponentially? I'm not so sure) with the number of entries, and if you have millions of small files in one directory, even doing something like dir folder_with_millions_of_files can take several minutes. Switching to another FS or OS will not solve the problem. Linux has the same behavior as the last time I checked.

Find a way to group images into subfolders of not more than a few hundred files each. Make the directory tree as deep as necessary to support this.

+1
source

The solution is likely to limit the number of files in the directory.

I had a very similar problem with the financial data contained in ~ 200,000 flat files. We solved this by storing files in directories based on their name. eg.

 gbp97m.xls 

stored in

 g/b/p97m.xls 

This works great provided that your files are named appropriately (we had the distribution of characters to work with). Thus, the resulting tree of directories and files was not optimal from the point of view of distribution, but it worked well enough to reduce each directory to 100 files and to free up a bottleneck on the disk.

0
source

One solution is to implement haystacks . This is what Facebook does for photos, because the metadata and random readings needed to extract the file are quite high and have no value for the data store.

Haystack introduces a universal HTTP-based object repository that contains needles that map to stored opaque objects. Storing photos in the form of needles in a haystack eliminates the overhead of metadata by aggregating hundreds of thousands of images in a single haystack file. This significantly reduces the overhead of metadata and allows you to store each needle location in the storage file in the memory index. This allows you to retrieve image data in a minimum number of input / output operations, eliminating all unnecessary metadata overhead.

0
source

All Articles