List breaking using dynamic programming

I wrote a little here related to the project that I was trying to work on, and I continue to encounter design problems and have to design from scratch. So I wonder if I can post what I'm trying to do, and someone can help me figure out how I can get the result I want.

BackGround:

I am new to programming and try to learn. Therefore, I took the project that interests me, which includes basically collecting the list and breaking down each number using only the numbers from the list. I know that I could easily use this command (which I did), but I also wanted to learn about Hbase, Hadoop and parallel processing, so I wanted to do this so that I could split the process on different machines. I thought that for this you need to use dynamic programming and recursion to create a feature table that can be broken down even further.

Example:

If I send the list: [1,2, 4] , I should get {1: [[1]], 2: [[1, 1]], 4: [[2, 2]]} . Which basically says 2 + 2 = 4, 1 + 1 = 2 and 1 = 1..so trying to see all the ways to make 4, I can just look at this list (which will be in the database) and see 2 + 2 = 4 then beat 2. etc. I have a search job, but with a breakdown I am having problems. Using brute force will not allow me to use large numbers (say, a million, with a thousand numbers on the list) in such a way that I can use hadoop or some other tool to scale it. Here are some more examples of possible results:

 [1,2, 3, 4] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]]} [1,2, 3, 4, 6, 10] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]], 6: [[2, 4], [3, 3]], 10: [[4, 6], [2, 2, 2, 2, 2]]} [10, 30, 50] = 50: [[10, 10, 30]], 30: [[10, 10, 10]]} 

The logic with this approach is that it does not take time to calculate the next possible data in the list, so if I send a list with a million numbers, it can do it quickly and even scale it to the hadoop cluster.

The code I created to make it work is here , but this question was about how to fix the design problem. I got advice that this is a partition problem, and looked around and found much simpler versions of what I was trying to do ( activestate ), but this is not exactly what I am trying to do, because it breaks the number and does not use a specific data set for of this.

Question:

So, hopefully, I have clearly described what I'm trying to do. What do I need to read / study / learn to create a list section table in python using dynamic programming so that it can scale? This is just a hobby, not a time, but I feel like I have been working on it for more than 3 months, and every time I come across design problems that make me start from scratch. How can I build this correctly (to see my wrong way, see My link above)? I found googled and found solutions to the backpack problem and partition problems, but they seem to be more suitable for school work and not designed to scale with large datasets.

Hope someone can give me an understanding, but regardless, thank you for that.

+4
source share
1 answer

You should be aware that DP problems alone are not optimal for independent and distributed computing.

When you look at the classic forms of DP algorithms, you have a matrix / table / array, and you sequentially compute the new values ​​in a specific order. Each calculation of a value requires other values ​​that must be created earlier. Therefore, you lose data independence and can maximize calculate a certain number of fields in the array at the same time, depending on the specific DP algorithms. For example, many DP algorithms can process an entire column of a table in parallel, since each of the fields uses the fields from the previous column. But this is already a limitation due to the dependence of data on all other fields after this column.

At the same time, calculating the total capabilities of the various numbers available on your list is NOT a DP problem. You do not solve any subtasks at all, but simply collect all the possible amounts (if they coincide with one of your entries in the list).

Therefore, I propose the following prominent approach:

  • Calculate a new list with all possible amounts. This data is data independent and can be parallelized, but you need to get the upper bound to complete. Example: [1,2,4] becomes [ [1],[2],[4],[1,1],[1,2],[1,4],...] . You do not need to explicitly create this list, but just pass each such combination to the next step.
  • Rate each calculation, i.e. create an amount and check if it matches the value from your source list. Again, you are data independent and you can do all of these calculations yourself.
  • Combine positive results in a final data structure.

So, to summarize and answer your questions:

  • Rethink if you want to consider this problem as a DP at all.
  • You might want to read parallelism data. This is especially true when solving such problems using GPUs, so the relevant CUDA / OpenCL literature can also be useful.
+3
source

All Articles