Processing multiple temporary small files

Question

Processing multiple temporary small files

I have a web server that saves cache files and stores them for 7 days. File names are md5 hashes, i.e. exactly 32 hexadecimal characters, and are stored in a tree structure that looks like this:

00/ 00/ 00000ae9355e59a3d8a314a5470753d8 . . 00/ 01/

You get the idea.

My problem is that deleting old files takes a lot of time. I have a daily cron job doing

 find cache/ -mtime +7 -type f -delete

which takes more than noon. I worry about scalability and the impact on server performance. Also, the cache directory is now a black hole in my system, capturing random innocent du or find .

The standard solution for the LRU cache is a kind of heap. Is there any way to scale this level of file system? Is there any other way to implement this in a way that simplifies management?

Here are the ideas I reviewed:

Create 7 top directories, one for each business day and one empty directory every day. This increases the search time for the cache file by 7 times, makes it really complicated when the file is overwritten, and I'm not sure what it will do with the delete time.
Save the files as drops in a MySQL table with indexes by name and date. It seemed promising, but in practice it was always much slower than the FS. Maybe I'm not doing it right.

Any ideas?

+7

linux filesystems caching

itsadok Nov 03 '08 at 9:31

source share

5 answers

Assuming this is ext2 / 3, have you tried adding to indexed directories? When you have a large number of files in any particular directory, the search will be very slow in order to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When installing the file system, be sure to use the noatime parameter, which does not allow the OS to update access time information for directories (they still need to be changed).
Looking at the original post, it seems that you have only 2 levels of indirectness of files, which means that you can have a huge number of files in leaf directories. When they contain more than a million records, you will find that searches and changes are very slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, which reduces the cost of searching and updating in a particular separate directory.

+4

Petesh Nov 03 '08 at 10:13

source share

Reiserfs is relatively efficient at processing small files. Have you tried using various Linux file systems ? I'm not sure about the effectiveness of deletion - you can consider formatting (mkfs) as a replacement for a separate file deletion. For example, you can create another file system (cache1, cache2, ...) for each day of the week.

+1

gimel Nov 03 '08 at 9:42

source share

How about this:

Call another folder, for example, "ToDelete"
When you add a new item, get today's date and find the subfolder in "ToDelete" that has a name for the current date.
If it is not, create it
Add a symbolic link to the item that you created in the current folder
Create a cron job that is sent to the ToDelete folder, which has the correct date and deletes all related folders.
Delete the folder containing all the links.

+1

OJ. Nov 03 '08 at 9:47

source share

How about having a table in your database that uses a hash as a key. The other field will be the file name. Thus, the file can be saved according to the date for quick deletion, and the database can be used to quickly find this location based on the hash.

0

David arno Nov 03 '08 at 9:35

source share

Tomalak · Accepted Answer · 2008-11-03T09:45:18+0000

When you store the file, create a symbolic link to the second directory structure, which is organized by date, not by name.

Extract your files using the "name" structure, delete them using the "date" structure.

Processing multiple temporary small files

More articles: