Split groups into almost equal stacks

I have a list of documents, and I want to display them grouped by the first letter of their name on a web page, in three columns.

In short, something like this:

A | C | E A | D | F B | D | F B | D | F | D | 

An important difference from the style of Windows Explorer is that I want the letters to stay with each other. There is no tearing group. To account for this, I don't care if a single column contains multiple records too high.

I started by sorting an array of documents by name and dividing them into a nested array. So I know (or can easily find out):

  • How many unique letters are there
  • How many letters are in each group.
  • How many records in total
  • The average of how many values โ€‹โ€‹should be in each column (ideally, but not necessarily)

I don't care what your answers come in. I am looking for an algorithm, not an implementation, so that you can code anything you want (except maybe Fortran). The explanation in HTML can be tough too.

I invite someone to go DJ by tags, because I could not come up with anything important and no, this is not homework, so please do not mark it as such.

+4
source share
7 answers

Perhaps this will help if you look at the problem as follows:

In your example, you have a line like this:

 AA BB C DDDD E FFF 

Positions in space are places where you could start a new column. In any other place you should not store the same letters in one column. Thus, you can mark a position in space as follows:

 AA1BB2C3DDDD4E5FFF 

Now you have 5 positions where you can either split the column or not, since this is a binary solution, use the brute force row 0 and 1 for this:

 12345 00000 -> no break at all, column count = 1, max. lines = 13 ... 01010 -> your example, column count = 3, max. lines = 5 ... 11111 -> breaks everywhere, column count = 6, max. lines = 4 

This is a brute force attempt, but you can easily see that the number 1 affects the number of columns (number of columns = number 1 + 1), and you want to minimize the maximum. lines, it should be possible to somehow figure out without experiencing every combination.

EDIT2: didn't recognize that you need 3 columns, this makes it easier since you know that you only have 3 1, but that is still brute force.

EDIT: Another approach I would prefer:

Write a letter like this:

 ABCDEF 2 2 1 4 1 3 

Now you can join the letters that are next to each other. Always choose the two letters with the smallest invoice amount:

 2 2 1 4 1 3 - lowest = "2 1" 2 3 4 1 3 - lowest = "1 3" 2 3 4 4 - lowest = "2 3" 5 4 4 - stop now, as we have 3 columns now Result: AABBC, DDDD, EFFF 

This may not lead to an optimal solution, but it is a good and easy way to solve your problem, I think.

+5
source

Well, you can expect to always have a few extra rows in each column. I mean, if you have 2 A, 2 B and 33 C, then the third column will be quite high compared to the others.

This is not Rukpeck's problem, because they must be in order.

What can you do:

  • Count the number of items.
  • See where the first third will fall.
  • If this is exactly the place of the letter change, then you are in luck :)
  • If not, then minimize the distance between the spot of separation of the third part and the spot of changing the previous / next letter - i.e. if the letter changes 2 entries before and 10 entries after, then go to the previous one.
  • Finally, take everything else, divide by two and follow the same logic to split as close to the average as possible.
+3
source

There is no general solution to this problem, subject to your limitations, unless entry is also restricted. Consider, for example, a collection with one document starting with the letters A, B, C, E and F and 15 (or millions) of documents starting with D. To group all D in one column, the column length must be at least 15. If you If you use more than two columns, then in the best case, column 1 will have a length of 3, column 2 will have a length of 15 (or a million), and column 3 will have a length of 2. This violates the "within several records" restriction.

You need to decide if the restriction on non-letter columns is important in order to guarantee potential large differences between the column sizes or the entry restrictions so that the problem can be solved with the given restrictions. Personally, I would rethink the interface as a solution to the optimization problem, just to keep the letters together seems redundant.

+3
source

I think you should start by defining a kind of โ€œmeasureโ€ that will tell you which layout is best, for example. take the sum (average_size - actual_size (column)) ^ 2 for all columns. Then, because you always have 3 columns (right?), It should be fast enough to accept all possible divisions and find the one that maximizes your measure.

0
source

First make a pass over the documents to build an array of letters-> counters.

First record (first letter in the array) -> document 0

Then find the entries that should appear in the second and third columns, going through the array, adding counts, but stopping before you pass the threshold for the second and third columns (which is 1/3 and 2/3 of the total).

0
source

This problem lends itself to a recursive solution - perhaps classical dynamic programming, although I did not work out it very accurately.

You have a fixed number of potential split points and a certain number of partitions. You should have something like

 (splits, max_ht, min_ht) = split_list(list, requested_splits, curr_max, curr_min) 

The function should iterate over potential split points and call itself recursively (with one less requested split) in the rest of the list. For instance.

 def split_list(list, requested_splits, curr_max, curr_min): best_splits = [] best_split_len = curr_max-curr_min best_max = curr_max best_min = curr_min if requested_splits == 0: return (best_splits, curr_max, curr_min) for candidate in candidate_split_points: this_max = max(curr_max, len(<list up to split point>) this_min = min(curr_min, len(<list up to split point>) (splits, new_max, new_min) = split_list(<list after split point>, requested_splits-1, this_max, this_min) if (new_max-new_min) < best_split_len: best_split_len = new_max - new_min best_max = new_max best_min = new_min best_splits = [candidate] + splits return (best_splits, best_max, best_min) 
0
source

You can try it here. Since you know your ideal column number (n), run all of your items in the first column to start with.

Repeat the following steps as often as you see fit ... its iterative algorithm, so the results will be faster to complete during the first few iterations, and then your returns will begin to decrease.

Run sequentially through the columns.

Let the number of elements in the current column be numCurrent.

If numCurrent <n, skip this column.

Keep track of items starting with the first letter (groupFirst) and the last letter (groupLast) of the current column.

Calculate the number of elements in the previous column (if any) as numPrev. If abs (n-numCurrent)> abs (n-numPrev + groupFirst), move groupFirst to the previous column.

Recount numCurrent.

As before, if there is a next column, change groupLast to it if abs (n-numCurrent)> abs (n-numNext + groupLast).

Rinse and repeat. The more rinses, the better it should look. There will be a point at which there will no longer be any changes, as well as points at which it can simply keep going. You decide how many iterations.

0
source

All Articles