Why use hashing to create paths for large file collections?

I noticed a number of cases where an application or database stored collections of files / blobs with a to determine the path and file name. I believe the intended result is a situation where the path never gets too deep, or the folders never get too full - there are too many files (or folders) in the folder creating slow access.

EDIT: The examples are often digital libraries or repositories, although the simplest example I can imagine (which can be installed in about 30 seconds) is the Zotero / citation database document.

What for?

EDIT: thanks Mat for the answer - does this technique of using a hash to create a file path have a name? Is this a sample? I would like to read more, but did not find anything in the ACM Digital Library

+3
source share
5 answers

Hash / B: Tree

The advantage of the hash is that you look faster when you are going to use the "=" operator to search.

If you intend to use things like "<" or ">" or anything else other than "=", you need to use B: Tree because it will be able to perform such searches.

Directory structure

If you have hundreds of thousands of files to be stored in the file system, and you put them all in one directory, you will reach the point where the index index of the directory will grow so much that it will take several minutes to add / remove a file from this directory. and you can even reach the point at which the index will not fit into memory, and you will not be able to add / remove or even touch the directory.

You can be sure that for the hash method foo, foo ("something") will always return the same, say, "grbezi". Now you use part of this hash to store the file, say, in gr / be / something. The next time you need this file, you just need to calculate the hash and it will be available directly. In addition, you get the fact that with a good hash function, the distribution of hashes in the hash space is pretty good, and for a large number of files they will be evenly distributed within the hierarchy, sharing the load.

+4
source

I think we need to take a closer look at what you are trying to do. In the general case, a hash and a B-Tree abstractly provide two general operations: insert element and search element. The hash executes them, asymptotically , O (1) times, while the hash function behaves well (although in most cases it behaves very poorly to a particular workload it can be just as bad as O (n).) For tree AB for comparison, O (log n) time is required for insertion and search. Therefore, if these are the only operations that you perform, a hash table is a faster choice (and much easier than implementing tree B if you have to write it yourself.)

The kicker comes when you want to add operations. If you want to do everything that requires ordering (which means, say, reading the elements in key order), you need to do other things, the simplest thing is to copy and sort the keys, and then access the keys using this temporary table. The problem is that the temporal complexity of sorting is O (n log n), so if you need to do this very strongly, the hash table no longer has a performance advantage.

+2
source

A hash is faster to check than to cross a B-tree. Therefore, if frequent checks of existence are carried out, this method may be useful. Other than that, I really don't understand the situation, because hash tables do not preserve order or hierarchy. Therefore, maintaining the directory structure in them is not possible if the directories must go individually.

0
source

Hashes also provide a unique opportunity for a path name. Very few name clashes.

0
source

Zotero, in particular, actually uses eight-character alphanumeric unique identifiers; they are not a hash of anything related to the main file, and they actually match the attachment key in the Zotero database (also used to access the file and its metadata using the Zotero API). The key is guaranteed to be unique in the local Zotero instance (well, for libraries with items up to 2821109907457), and it is combined with the library key to make a globally unique key for embedding in the larger Zotero world. Keys are used in the file system to a large extent to circumvent name conflicts and special characters.

I understand that many of the UUIDs that you see around the library and repository world are similar in justification - they are less prone to conflict than auto-incrementing numeric identifiers, which makes many things a lot easier, but they are not, unlike the correct SHA1 hashes, used as commit identifiers in git is a must hash.

0
source

All Articles