Split file into equal parts

Question

Split file into equal parts

using either bash or python (2.4.x)

I have a file - about 100 lines in a file, and the file is structured as follows.

aaaaa, 100 aaaab, 75 aaaac, 150 aaaad, 135 aaaae, 144 aaaaf, 12 aaaag, 5 aaaah, 34 aaaai, 11 aaaaj, 43 aaaak, 88 aaaal, 3 baaaa, 25 baaab, 33 baaac, 87 baaad, 111 baaae, 45 baaaf, 99 baaag, 71 baaah, 68 baaai, 168 baaaj, 21 baaak, 11 baaal, 47 caaaa, 59 caaab, 85 caaac, 77 caaad, 33 caaae, 44 caaaf, 16 caaag, 111 caaah, 141 caaai, 87 caaaj, 59 caaak, 89 caaal, 3

and what I want to do is divide it into 12 columns, with each column having approximately the same number of sensors, and the sum of each column is close to the same.

In other words, if I took the above list and divided it like this.

 aaaaa 100 aaaab 75 baaab 33 aaaai 11 baaah 68 baaac 87 aaaak 88 caaaa 59 caaac 77 199 202 197 aaaah 34 baaaf 99 caaad 33 baaad 111 baaal 47 aaaac 150 aaaaj 43 caaae 44 caaaf 16 188 190 199 aaaag 5 aaaaf 12 baaaa 25 aaaad 135 caaai 87 caaag 111 caaaa 59 caaak 89 baaag 71 199 188 207 aaaae 144 baaaj 21 caaaj 59 aaaal 3 baaak 11 caaah 141 baaae 45 baaai 168 caaal 3 192 200 203

it makes 12 columns of equal elements and is almost close to an even value.

I can do it manually, but we will have to do it several times. I'm not even sure where to start, except that it falls into the array by counting the elements in the array and arbitrarily separating them. still stuck on alignment of values.

Any pointers?

+4

python bash

Chasester May 15, '12 at 15:59

source share

4 answers

An interesting problem, I think you are stuck with a rather expensive process to find the best solution. You can calculate the number of elements in the calculation and the average value that they should have. Sort the elements by an integer and take the largest number while the value is still below the average, then repeat this process until you need to add another element, now select the smallest and try to approach the average value as you can (above or below not matter).

If you get stuck at any step except the last (for example, value> average value), go back and select the next largest one.

+1

Rhand May 15, '12 at 16:09

source share

I wrote two very simple implementations. The first uses deque for pop music both on the right and on the left (after sorting the list) to put low values with high ones. The second is the one suggested by @Sean McSomething.

Here is the code (quick'n'dirty - with a few comments, unfortunately):

 import math import itertools import collections def sum_column(data): return sum(zip(*data)[1], 0.0) def split_groups(sensors): sensors.sort(key=lambda item: item[1], reverse=True) per_group = len(sensors) // 12 average = sum_column(sensors) / len(sensors) data = collections.deque(sensors) groups = [[] for i in xrange(12)] cycle = itertools.cycle(groups) try: while True: current = cycle.next() if len(current) == per_group - 1: if sum_column(current) < average: current.append(data.popleft()) else: current.append(data.pop()) continue current.append(data.popleft()) current.append(data.pop()) except IndexError: return groups def split_groups2(sensors): sensors.sort(key=lambda item: item[1], reverse=True) groups = [[] for i in xrange(12)] cycle = itertools.cycle(groups) per_group = int(math.ceil(len(sensors) / 3.)) partitions = [sensors[i:i + per_group] for i in xrange(0, len(sensors) per_group)] medium, low = map(reversed, partitions[1:]) for sensor, value in itertools.chain(partitions[0], medium, low): cycle.next().append((sensor, value)) return groups def format_groups(result): ret = [] for group in result: tmp = [] tmp.append('\n'.join('{0} {1}'.format(k, v) for k, v in group)) tmp.append(' ' * 8 + str(int(sum_column(group)))) ret.append('\n'.join(tmp)) return '\n\n'.join(ret) if __name__ == '__main__': import sys implementation = split_groups if '--second' in sys.argv: sys.argv.remove('--second') implementation = split_groups2 with open(sys.argv[1]) as fobj: sensors = [] for line in fobj: sensor, value = line.strip().split(', ') sensors.append((sensor, int(value))) sys.stdout.write(format_groups(split_groups(sensors))) sys.stdout.write('\n')

In Gist too:
https://gist.github.com/2703965

I left the easy part (formatting) up to you. At the moment, it just prints them vertically (and not by default, as you requested). It should not be too complicated.

This is the best that it can achieve (with both implementations):

 (max(sums) - min(sums)) / 2. = 16.0

Which is far from an example, but this is the beginning. You can run it from the command line with the file name and, possibly, with the --second option (to use the second implementation). I could use the command line parser, but I'm used to argparse, which is not in Python 2.4. So I just went for this awkward hack.

Execution Example:

 $ python2 groupit.py filename.txt baaai 168 caaal 3 aaaaj 43 214 aaaac 150 aaaal 3 caaae 44 197 aaaae 144 aaaag 5 baaae 45 194 caaah 141 baaak 11 baaal 47 199 aaaad 135 aaaai 11 caaaj 59 205 baaad 111 aaaaf 12 caaaa 59 182 caaag 111 caaaf 16 baaah 68 195 aaaaa 100 baaaj 21 baaag 71 192 baaaf 99 baaaa 25 aaaab 75 199 caaak 89 caaad 33 caaac 77 199 aaaak 88 baaab 33 caaab 85 206 baaac 87 aaaah 34 caaai 87 208 $ python2 groupit.py --second filename.txt baaai 168 caaal 3 aaaaj 43 214 aaaac 150 aaaal 3 caaae 44 197 aaaae 144 aaaag 5 baaae 45 194 caaah 141 baaak 11 baaal 47 199 aaaad 135 aaaai 11 caaaj 59 205 baaad 111 aaaaf 12 caaaa 59 182 caaag 111 caaaf 16 baaah 68 195 aaaaa 100 baaaj 21 baaag 71 192 baaaf 99 baaaa 25 aaaab 75 199 caaak 89 caaad 33 caaac 77 199 aaaak 88 baaab 33 caaab 85 206 baaac 87 aaaah 34 caaai 87 208

With an example in the question, two algorithms give the same answer. If you could provide more test cases, I would try to improve them. I tested the script in Python 2.7 since I don't have 2.4 installed.
Sorry for the long answer.

+1

rubik May 15, '12 at 18:34

source share

Here is what I have come up with so far. I think this is pretty close, and without a very high value, the displacement of things is a bit - well, it's close.

Glad to hear any suggestions to make it more pythonic.

file: columnsplit.py

 #!/usr/bin/python import sys, operator # usage # columnsplit.py <filename> <#cols> # columnsplit.py test.csv 12 # #determine number of devices per column def devicelisting(fulllist,percolumn): devicelist=[] fobj=open(fulllist,'r') for line in fobj: (key, val) = line.split(',') devicelist.append((key,int(val))) devicespercol=(len(devicelist)/int(percolumn)) return(devicelist,devicespercol) def devicesplit(fulllist,numcolumns,roundnum): if roundnum == 0: devices=sorted(fulllist, key=lambda device: device[1], reverse=True) devicestemp=devices else: devices=sorted(fulllist, key=lambda device: device[1]) devicestemp=devices deviceslice=[] for idx, val in zip(range(numcolumns), devices): deviceslice.append(val) devicestemp.remove(val) return(deviceslice,devicestemp) def makecolumns(roundnumber,percol): column=[] for i in range(percol): exec('tempslice=deviceslice%s' % i) column.append(tempslice[roundnumber]) return(column) # what this is going to do is generate how many devices will fill each of the intended # number of columns. What is left over will be run again against the lowest value of columns if __name__ == '__main__': tempslice=[] devices,percol=devicelisting(sys.argv[1],sys.argv[2]) # devices is the devices/value as tuples nested in a list # percol is going to be how many devices per column # you can len(devices) to count how many devices we have # prints out the device list in reverse. # print sorted(devices, key=lambda x: x[1], reverse=True) # what we will need to do here is split the device list into number of desired slices. ie if we want 12 columns # and we have 108 devices there should be 9 slices of 12. # this will leave a remaining slice - of less than 9 which will be added to the 12 columns in order of smallest column first devicesleft=devices numcolumns=int(sys.argv[2]) for i in range(percol): sendcol,devicesleft=devicesplit(devicesleft,numcolumns,i) exec('deviceslice%s=sendcol' % i) # and finally create the columns for i in range(0,numcolumns): sendcol=makecolumns(i,percol) exec('column%s=sendcol' % i) # add the left over devices j=numcolumns # sort remaining reverse. devices=sorted(devicesleft, key=lambda device: device[1], reverse=True) for i in range(len(devices)): j-=1 exec('column%s.append(devices[i])' % j) # prints out the resulting columns for i in range(0,numcolumns): exec('tempcol=column%s' % i) print tempcol print sum([pair[1] for pair in tempcol])

In the test file, I ran it.

file: test44a.csv

 SQCIEOEO,1272 HIKTXYZH,281 JZHRZXKX,5793 UBGTOLUX,147 WBVYFNBN,9 VMHTKHBU,32 GILGFWDA,1334 YKUMWOKT,2066 PFSVTUIP,51 GPJRWKMD,673 TYJZUNZS,27 XTFUHPNX,2102 VFSPABFG,65 ROYOZKRS,189 IARDNRVL,587 LBFSQTQL,973 ZJBZKGFB,21301 UEPUOHMW,20 HEAVWVGH,0 XMANFQZE,719 ZADKGIMB,82 NCVBJIYR,27 NYMJUSQR,20646 EQFKHEOH,2050 ERRLAENN,19 HIPRQNIE,12557 MVNHODYT,20 UEDBIRIN,14 JAZJEMXL,28 UMDLALPN,36 GCUUGTNA,0 XRCGIKTR,12 KSBPEYBZ,20657 LELLPAYW,43792 DTRKMFLK,73 WNQEXJWI,41 CYXHXYHI,10 CSUSTTOX,120 NFHZLSJH,23 FAMDKJLM,25 HIUEHBNJ,261 UIBNCQKP,40 WSPHKYOQ,30025 ZBUJKFWR,0 OQWVSKFM,49 SHZUXKKU,21 CZBMYQDX,45 RXGBCCTR,17 SPMLASXS,15 ZWNXGXRI,59 WTVUJZSB,22 WYDZBWQU,19100 MDFMVCFV,6133 ZSSGQJPM,25 CKHMJZOG,85 YRFZOWTB,28 AYNWBSRA,14 LJGBTVOW,13110 GWJPWXWU,16 PCUDYNEY,179 MSVNLMOX,62 WUYPPNMW,2285 KVLGTIBI,11 KWMIKQHW,11 JDKUPYRM,1851 DARXQYDY,68 UUPXIDEP,139 SKQZMTFY,4377 ZEPOWAEA,189 BWXRVAPP,167 VFMDIRTA,561 BKANEGMD,2122 LBRICWID,1775 TGVOGLDC,3650 QQGZHAAJ,81 KAXPHJSS,122 LKAOHISA,32 ONOVZSYQ,41 IEPQEPZP,62 QWEXGXQS,0 IQGPZYQO,15 MEJLXIBG,10 MRWRHWHX,10 TMVAJLSS,57 BYIAXYOJ,173 DYUAGWGT,248 ODLVZSST,21 EOTOZLHA,6476 KPBHOQQR,30 OLSVIYOW,539 CZSCSLVX,17 ZPMYBTZL,11 IATWRKOF,12507 WGBEFQBH,41 PUJIFEFE,382 TSDULCGU,9070 DARUKFAG,209 MBLRRNYH,250 IIQNNWSG,25 OWBZYIUC,1808 ILXTRXZD,2012 ZLVRZUYH,269 CPVPLOWZ,108 KYZJGTMO,635 EJHWGHZG,25 TUXTOWBR,11 LXGXLCWW,2313 AVFHPRWT,915 AEPHMPNF,32 KLZZHAQT,56 XWQJZNFA,611 JKHYCDSC,1455

to run: python columnsplit.py test44a 12 (12 is the number of desired columns).

Example output columns with the value of the first column .:

 1) 45577 [('LELLPAYW', 43792), ('HEAVWVGH', 0), ('XRCGIKTR', 12), ('ODLVZSST', 21), ('VMHTKHBU', 32), ('TMVAJLSS', 57), ('KAXPHJSS', 122), ('ZLVRZUYH', 269), ('SQCIEOEO', 1272)] 2) 31906 [('WSPHKYOQ', 30025), ('GCUUGTNA', 0), ('UEDBIRIN', 14), ('WTVUJZSB', 22), ('LKAOHISA', 32), ('ZWNXGXRI', 59), ('UUPXIDEP', 139), ('HIKTXYZH', 281), ('GILGFWDA', 1334)] 3) 23416 [('ZJBZKGFB', 21301), ('ZBUJKFWR', 0), ('AYNWBSRA', 14), ('NFHZLSJH', 23), ('AEPHMPNF', 32), ('MSVNLMOX', 62), ('UBGTOLUX', 147), ('PUJIFEFE', 382), ('JKHYCDSC', 1455)] 4) 23276 [('KSBPEYBZ', 20657), ('QWEXGXQS', 0), ('SPMLASXS', 15), ('FAMDKJLM', 25), ('UMDLALPN', 36), ('IEPQEPZP', 62), ('BWXRVAPP', 167), ('OLSVIYOW', 539), ('LBRICWID', 1775)] 5) 23342 [('NYMJUSQR', 20646), ('WBVYFNBN', 9), ('IQGPZYQO', 15), ('ZSSGQJPM', 25), ('UIBNCQKP', 40), ('VFSPABFG', 65), ('BYIAXYOJ', 173), ('VFMDIRTA', 561), ('OWBZYIUC', 1808)] 6) 21877 [('WYDZBWQU', 19100), ('CYXHXYHI', 10), ('GWJPWXWU', 16), ('IIQNNWSG', 25), ('WNQEXJWI', 41), ('DARXQYDY', 68), ('PCUDYNEY', 179), ('IARDNRVL', 587), ('JDKUPYRM', 1851)] 7) 16088 [('LJGBTVOW', 13110), ('MEJLXIBG', 10), ('RXGBCCTR', 17), ('EJHWGHZG', 25), ('ONOVZSYQ', 41), ('DTRKMFLK', 73), ('ROYOZKRS', 189), ('XWQJZNFA', 611), ('ILXTRXZD', 2012)] 8) 15607 [('HIPRQNIE', 12557), ('MRWRHWHX', 10), ('CZSCSLVX', 17), ('TYJZUNZS', 27), ('WGBEFQBH', 41), ('QQGZHAAJ', 81), ('ZEPOWAEA', 189), ('KYZJGTMO', 635), ('EQFKHEOH', 2050)] 9) 17952 [('IATWRKOF', 12507), ('KVLGTIBI', 11), ('ERRLAENN', 19), ('NCVBJIYR', 27), ('CZBMYQDX', 45), ('ZADKGIMB', 82), ('DARUKFAG', 209), ('GPJRWKMD', 673), ('YKUMWOKT', 2066), ('LXGXLCWW', 2313)] 10) 15982 [('TSDULCGU', 9070), ('KWMIKQHW', 11), ('UEPUOHMW', 20), ('JAZJEMXL', 28), ('OQWVSKFM', 49), ('CKHMJZOG', 85), ('DYUAGWGT', 248), ('XMANFQZE', 719), ('XTFUHPNX', 2102), ('TGVOGLDC', 3650)] 11) 14358 [('EOTOZLHA', 6476), ('ZPMYBTZL', 11), ('MVNHODYT', 20), ('YRFZOWTB', 28), ('PFSVTUIP', 51), ('CPVPLOWZ', 108), ('MBLRRNYH', 250), ('AVFHPRWT', 915), ('BKANEGMD', 2122), ('SKQZMTFY', 4377)] 12) 15683 [('MDFMVCFV', 6133), ('TUXTOWBR', 11), ('SHZUXKKU', 21), ('KPBHOQQR', 30), ('KLZZHAQT', 56), ('CSUSTTOX', 120), ('HIUEHBNJ', 261), ('LBFSQTQL', 973), ('WUYPPNMW', 2285), ('JZHRZXKX', 5793)]

0

Chasester May 25 '12 at 18:55

source share

Sean mcsomething · Accepted Answer · 2012-05-15T16:43:57+0000

It will not be fun for large entrances if you want an optimal solution. You look at what is right in line with some of the very well-known difficult issues in CS - Knapsack , Bin Packing and the like. Some simpler, less perfect solutions may be fairly close.

This is not accurate, but given your sample dataset, I managed to get the sizes 214, 197, 194, 199, 205, 182, 195, 192, 199, 199, 206, 208 from a very simple method. It may or may not work with real data.

Method:

Sort list by size
Divide the list into 3 parts - High, Medium and Low
Put each item with a high value in a set.
Reverse middle and low listings.
Put them (in reverse order) into existing sets

Decisions can become more complex as they approach optimal separation.

Split file into equal parts

More articles: