Finding the optimal file combination

This is a problem, I would think that there is an algorithm already, but I don’t know which correct words to use with Google it seems :).

Problem: I would like to make a small program with which I would choose a directory containing any files (but for my purposes - media files, audio and video). After that, I would like to enter in MB the maximum total amount of the file size, which should not be exceeded. At this point, you will click the "Calculate the best fit" button.

This button should compare all files in the directory and as a result provide a list of files that, when combined, are as close as possible to the maximum size of the entire file, without going beyond the limit.

This way, you can find out which files to combine when burning a CD or DVD so that you can use as much of the disc as possible.

I tried to come up with an algorithm for this myself - but failed: (.

Does anyone know of any good algorithm for this?

Thanks in advance:)

+7
language-agnostic algorithm file-io
source share
6 answers

This, as stated, is the Backpack problem, which is a combinatorial optimization problem. This means that you are looking for some subset or permutation of a set that minimizes (or maximizes) a certain cost. Another well-known problem is the problem with the seller .

Such problems are usually very difficult to solve. But if you are interested in near-optimal solutions, you can use non-deterministic algorithms such as simulated annealing . You most likely will not get the optimal solution, but almost the optimal one.

This link explains how simulated annealing can solve the Backpack problem, and therefore should be of interest to you.

+5
source share

Just for fun, I tried out the exact dynamic programming solution. It is written in Python because of my belief in the belief that you should not optimize until you need it; -)

This can provide either a start or a rough idea of ​​how close you can get before resorting to approximation.

Code based on http://en.wikipedia.org/wiki/Knapsack_problem#0-1_knapsack_problem , therefore, the names of less than informative variables m , W , W , v .

 #!/usr/bin/python import sys solcount = 0 class Solution(object): def __init__(self, items): object.__init__(self) #self.items = items self.value = sum(items) global solcount solcount += 1 def __str__(self): #return str(self.items) + ' = ' + str(self.value) return ' = ' + str(self.value) m = {} def compute(v, w): coord = (len(v),w) if coord in m: return m[coord] if len(v) == 0 or w == 0: m[coord] = Solution([]) return m[coord] newvalue = v[0] newarray = v[1:] notused = compute(newarray, w) if newvalue > w: m[coord] = notused return notused # used = Solution(compute(newarray, w - newvalue).items + [newvalue]) used = Solution([compute(newarray, w - newvalue).value] + [newvalue]) best = notused if notused.value >= used.value else used m[coord] = best return best def main(): v = [int(l) for l in open('filesizes.txt')] W = int(sys.argv[1]) print len(v), "items, limit is", W print compute(v, W) print solcount, "solutions computed" if __name__ == '__main__': main() 

For simplicity, I just consider file sizes: after you have a list of sizes that you want to use, you can find several file names with these sizes by searching the list, so there is no point in confusing the file names in the kernel, the slow programs. I also express everything in multiple block sizes.

As you can see, I commented on the code that gives the actual solution (as opposed to the value of the solution). This was to save memory - the correct way to store the list of used files is not one list in each solution, it must have every solution for the solution from which it was obtained. Then you can calculate the list of files at the end, returning through the chain, displaying the difference between the values ​​at each step.

A list of 100 randomly generated files in the 2000-6000 range (I assume 2k blocks, so files are 4-12MB in size), this solves for W = 40K in 100 seconds on my laptop. In doing so, it calculates 2.6M of possible 4M solutions.

Complexity is O (W * n), where n is the number of files. This does not contradict the fact that the problem is NP-complete. Therefore, at least I come up with a solution, and this is just in non-optimized Python.

Obviously, some optimization is now required, because in fact it needs to be solved for W = 4M (8 GB DVD) and, nevertheless, many files that you have (say, several thousand). Assuming the program is allowed to take 15 minutes (which is comparable to the time it takes to burn a DVD), this means that the performance is currently short at about 10 ^ 3. So we have a problem that is pretty hard to solve quickly and accurately on a PC, but not beyond technology.

The main focus is on memory usage, because after we start using swap, we will slow down, and if we run out of virtual address space, we have real problems, because we must implement our own storage of solutions on disk. My test peak reaches 600 MB. If you wrote C code on a 32-bit machine, each “solution” has a fixed size of 8 bytes. Therefore, you can create a massive 2-dimensional array from them without taking up any memory allocation in the loop, but in 2 GB of RAM you can only process W = 4M and n = 67. Unfortunately, there are no DVDs. This can very closely solve for 2-k CD blocks, though: W = 350k gives n = 766.

Edit: A MAK clause for calculating an iterative upstream, rather than recursively from top to bottom, should significantly reduce memory requirements. First, calculate m (1, w) for all 0 <= w <= W. From this array, you can calculate m (2, w) for all 0 <= w <= W. Then you can throw away all the values ​​of m (1, w): you won’t need them to calculate m (3, w), etc.

By the way, I suspect that in fact the problem you want to solve may be the bin packing problem, and not just the question of how to get the maximum possible value for filling the DVD. If you have a bunch of files, you want to burn them to DVD using as many DVDs as possible. There are situations when the solution to the problem of packing bins is very simple, but solving this problem is difficult. For example, suppose you have 8GB disks and 15GB small files. Several searches will be made to find the best possible match with 8 GB, but the problem with packing the beans will be trivially solved by simply putting about half the files on each disk - it doesn’t matter how you split them, because you lose 1 GB of space so that you would nor did.

All that said, there is a very fast heuristic, which gives decent results most of the time. The easiest way is to go through the list of files (possibly in decreasing order of size) and include each file, if appropriate, exclude it otherwise. You only need to go back to something slow if quick approximate solutions are not "good enough" for your choice of "enough."

+4
source share

It looks like you have a hard problem. This problem is well known, but effective solutions (maybe?) Do not exist.

+2
source share

Another obvious way then is to try all permutations of objects with size <bucket, you can also look at the implementation of the bucketizer perl module, which does exactly what you are asking for. I'm not sure what he does exactly, but the manual mentions that there is one way of “brute force,” so I guess there should be some kind of optimization as well.

0
source share

Thank you for your responses.

I studied this issue more with the guidance of these answers. Among other things, I found this web page, http://www.mathmaniacs.org/lessons/C-subsetsum/index.html . He talks about the problem of the sum of a subset, which, I believe, is the problem that I described here.

One of the suggestions on the web page:

-

You can point out that a number like 2300 is so large that even a computer, counting at a speed of more than a million or billion per second, will not reach 2300 until our sun burns out.

-

Personally, I would have more benefit for this algorithm when comparing a larger file size than say 10 or less, since it is somehow easy to achieve the maximum possible amount simply by trial and error manually, if the number of files is small.

A CD with mp3: s can easily have 100 mp3s and a DVD much more, which causes the sun to burn out before I answer :).

A random attempt to find the optimal amount may seem to get you pretty close, but it can never be guaranteed as the optimal answer, and also with failure can be quite far. Brute force is the only real way to get the best answer, and it takes too much time.

Therefore, I think I just continue to evaluate manually a good combination of files for burning to CD and DVD. :)

Thanks for the help.:)

0
source share

If you are looking for a reasonable heuristic, and the goal is to minimize the number of drives required, here is a simple one you might consider. It is similar to the one I used recently for problems with the store. I was able to compare it with the known optima and found that it provides optimal or very close to the optimal distribution distribution.

Suppose that B is the size of all the merged files, and C is the capacity of each disk. Then you will need at least n = roundup (B / C) disks. Try to put all the files on n disks. If you are able to do this, you are done and get the best solution. Otherwise, try installing all the files on the n + 1 disk. If you are able to do this, you have a heuristic solution; otherwise, try installing files on n + 2 disks, etc., until you can do it.

For any given distribution of files on disks below (which may exceed some disk capacities), let si be the combined size of the files allocated for disk i and t = max si. We finished when t <= C.

First order (and index) the files with the smallest smallest.

For m> = n disks,

  • Select the files on the disks in the reverse order: 1-> 1, 2-> 2, ... m-> m, m + 1> m-1, m + 2 → m-2, ... 2m-> 1, 2m + 1-> 2, 2m + 2-> 3 ... 3m-> m, 3m + 1-> m-1 and so on, until all files are distributed without regard to disk capacity. If t <= C, we are done (and the distribution is optimal if m = n); else go to # 2.

  • Try decreasing t by moving the file from drive i from si = t to another drive without increasing t. Continue to do this until t <= C, in which case we will end (and the distribution will be optimal if m = n), or t cannot be reduced further, in which case go to # 3.

  • An attempt to reduce t by performing pairwise exchanges between disks. Continue to do this until t <= C, in which case we will end (and the distribution will be optimal if m = n), or t cannot be further reduced by pair exchanges. In the latter case, repeat # 2 if no improvement was made for the last time # 2, in which case increment m by one and repeat # 1.

In # 2 and # 3 there are, of course, different ways of ordering possible redistributions and pair exchanges.

0
source share

All Articles