Building a directory tree from a list of file paths

I am looking for a time-efficient method for analyzing a list of files in a tree. There may be hundreds of millions of file paths.

A brute force solution should break each path into a directory separator and move through the tree, adding a directory and file to the entries, performing string comparisons, but this will be exceptionally slow.

Inputs are usually sorted alphabetically, so the list will look like this:

C: \ Users \ Aaron \ AppData \ Amarok \ AFile

C: \ Users \ Aaron \ AppData \ Amarok \ Afile2

C: \ Users \ Aaron \ AppData \ Amarok \ Afile3

C: \ Users \ Aaron \ AppData \ Blender \ alibrary.dll

C: \ Users \ Aaron \ AppData \ Blender \ and_so_on.txt

From this streamlining, my natural reaction is to split the directory lists into groups ... somehow ... before doing slow string comparisons. I'm really not sure. I would be grateful for any ideas.

Edit: It would be better if this tree were lazily loaded from top to bottom, if possible.

+3
source share
3 answers

You have no choice but to do a full string comparison, since you cannot guarantee where the strings may differ. There are a few tricks that can speed things up a bit:

  • As David said, form a tree, but find the new insertion point from the previous one (perhaps using some matchingPrefix routine that tells you where the new one differs).
  • Use a hash table for each level of the tree if there can be a lot of files in it and you need to count duplicates. (Otherwise, adding to the stack is fine.)
+1
source

if possible, you can generate your tree structure with the tree command, here

0
source

To take advantage of the β€œusually sorted” property of your input, start a crawl in the directory where your last file was inserted: compare the directory name of the current path with the previous one. If they match, you can simply paste here, otherwise a level will appear and try again.

0
source

All Articles