How to convert a directory path to a unique numeric identifier (Linux / C ++)?

I am exploring ways to take a directory (folder) and get some form of unique numeric identifier. I researched the string to hash methods, however the Pigeon Hole Principle means that it is never possible to get a truly unique number for each individual line.

A string for a unique hash is not good.

I recently studied other ways to achieve my goal and therefore asked the following question:

Directory timestamps - how unique are they? What resolution does the timestamp reported by "stat" refer to as described here (second message)? if the resolution is small enough, is it possible for more than one folder to use the exact time stamp on a Linux system?

If anyone has other methods / methods that they would like to share, I would be happy to listen :)

Change 1 . To clarify my use case in response to the answers posted so far: I work on Android platforms, so the file system is not connected to any other (except, of course, for removable media such as Micro SD cards).

I insert every path into the database, but try to avoid string comparison when querying the table. Using maps / hash maps is not an option here. Yes, the path itself is unique, but ideally I need a numeric identifier that can be used to query the table, unlike the path itself. The identifier must also be unique for each path. I experimented with std :: collate, but found that there were a lot of collisions in the hashes (a data set of 20,000 paths that approach 100 collisions). What is even more surprising is that the hashes turned out to be significantly different each time the application was launched. I wonder what it somehow sowed?

Thanks a lot, P

+4
source share
3 answers

On any UNIX system, you can use the inode number as a unique identifier on this file system. Combining it with the device number will make it unique in the car. If you want it to be unique globally, you can specify the primary MAC address of the system.

Keep in mind that:

  • The inode number will β€œfollow” the directory if it is moved or renamed. It will change if the directory is deleted and replaced.

  • The inode number will not be stable on different systems except for one or two truly special directories. (For example, / usually inode 2.)

+5
source

+1 duskwuff, good one!

Another way is to simply treat the dir path as a number ("BigInt").

Take this directory, for example: /opt/www/log
These are 12 characters.
12 * 8 bits = 96 bits.
Thus, you have a 96-bit long number that you can represent in hex / base64 / anything (in case you need to pass it as an HTML link).

I personally would go for the duskwuff approach, though.

+1
source

I think this very much depends on why you need a unique digital identifier. Timestamps can change, inodes can change, changes can change, MAC addresses can change. (However, +1 for duskwuff)

In some scenarios, you can simply create a table where each path you add gets a new unique number, just like the numeric key columns in a database.

Although hashes can collide, in every real environment this is absolutely unlikely (unless you use the most deceitful algorithm around ...) Most likely, you get errors due to flaws in your implementation, for example, you handle "/ tmp" differently than "/ tmp /" because you do not normalize the paths before hashing them. Or you want to distinguish physical folders, but forget to check the links to hard links and symbolic links to the same folder, this way you will get several hashes / id for the same directory.

Again, depending on your usecase, the collision is not necessarily fatal: if you find that the new path leads to the same hash as the existing one (will not happen!), You can still respond to this case. (*)

Just to help your imagination: if you use a 64-bit hash, you can fill up 150,000,000 1 TB hard drives with empty folders (nothing but short folder names), and then you will have a collision for sure. If you think that it is too risky (blinking, blinking), use a 128-bit hash, which makes it 18,446,744,073,710,000 times less likely.

Hashes are designed to make collisions unlikely, and even the good old MD5 will do its job well if no one willingly tries to create a collision.

(*) Edit: In this article, you see that a collision means that the search is no longer O (1), but a bit slower. Since this rarely happens, you can easily live with it. If you use std :: map (no hash) or std :: hashmap, you do not have to worry about collisions. See what is the difference between a map and a hashmap in STL .

0
source

All Articles