Just for fun, I tried out the exact dynamic programming solution. It is written in Python because of my belief in the belief that you should not optimize until you need it; -)
This can provide either a start or a rough idea of ​​how close you can get before resorting to approximation.
Code based on http://en.wikipedia.org/wiki/Knapsack_problem#0-1_knapsack_problem , therefore, the names of less than informative variables m , W , W , v .
#!/usr/bin/python import sys solcount = 0 class Solution(object): def __init__(self, items): object.__init__(self) #self.items = items self.value = sum(items) global solcount solcount += 1 def __str__(self): #return str(self.items) + ' = ' + str(self.value) return ' = ' + str(self.value) m = {} def compute(v, w): coord = (len(v),w) if coord in m: return m[coord] if len(v) == 0 or w == 0: m[coord] = Solution([]) return m[coord] newvalue = v[0] newarray = v[1:] notused = compute(newarray, w) if newvalue > w: m[coord] = notused return notused # used = Solution(compute(newarray, w - newvalue).items + [newvalue]) used = Solution([compute(newarray, w - newvalue).value] + [newvalue]) best = notused if notused.value >= used.value else used m[coord] = best return best def main(): v = [int(l) for l in open('filesizes.txt')] W = int(sys.argv[1]) print len(v), "items, limit is", W print compute(v, W) print solcount, "solutions computed" if __name__ == '__main__': main()
For simplicity, I just consider file sizes: after you have a list of sizes that you want to use, you can find several file names with these sizes by searching the list, so there is no point in confusing the file names in the kernel, the slow programs. I also express everything in multiple block sizes.
As you can see, I commented on the code that gives the actual solution (as opposed to the value of the solution). This was to save memory - the correct way to store the list of used files is not one list in each solution, it must have every solution for the solution from which it was obtained. Then you can calculate the list of files at the end, returning through the chain, displaying the difference between the values ​​at each step.
A list of 100 randomly generated files in the 2000-6000 range (I assume 2k blocks, so files are 4-12MB in size), this solves for W = 40K in 100 seconds on my laptop. In doing so, it calculates 2.6M of possible 4M solutions.
Complexity is O (W * n), where n is the number of files. This does not contradict the fact that the problem is NP-complete. Therefore, at least I come up with a solution, and this is just in non-optimized Python.
Obviously, some optimization is now required, because in fact it needs to be solved for W = 4M (8 GB DVD) and, nevertheless, many files that you have (say, several thousand). Assuming the program is allowed to take 15 minutes (which is comparable to the time it takes to burn a DVD), this means that the performance is currently short at about 10 ^ 3. So we have a problem that is pretty hard to solve quickly and accurately on a PC, but not beyond technology.
The main focus is on memory usage, because after we start using swap, we will slow down, and if we run out of virtual address space, we have real problems, because we must implement our own storage of solutions on disk. My test peak reaches 600 MB. If you wrote C code on a 32-bit machine, each “solution” has a fixed size of 8 bytes. Therefore, you can create a massive 2-dimensional array from them without taking up any memory allocation in the loop, but in 2 GB of RAM you can only process W = 4M and n = 67. Unfortunately, there are no DVDs. This can very closely solve for 2-k CD blocks, though: W = 350k gives n = 766.
Edit: A MAK clause for calculating an iterative upstream, rather than recursively from top to bottom, should significantly reduce memory requirements. First, calculate m (1, w) for all 0 <= w <= W. From this array, you can calculate m (2, w) for all 0 <= w <= W. Then you can throw away all the values ​​of m (1, w): you won’t need them to calculate m (3, w), etc.
By the way, I suspect that in fact the problem you want to solve may be the bin packing problem, and not just the question of how to get the maximum possible value for filling the DVD. If you have a bunch of files, you want to burn them to DVD using as many DVDs as possible. There are situations when the solution to the problem of packing bins is very simple, but solving this problem is difficult. For example, suppose you have 8GB disks and 15GB small files. Several searches will be made to find the best possible match with 8 GB, but the problem with packing the beans will be trivially solved by simply putting about half the files on each disk - it doesn’t matter how you split them, because you lose 1 GB of space so that you would nor did.
All that said, there is a very fast heuristic, which gives decent results most of the time. The easiest way is to go through the list of files (possibly in decreasing order of size) and include each file, if appropriate, exclude it otherwise. You only need to go back to something slow if quick approximate solutions are not "good enough" for your choice of "enough."