Find spaces in a string sequence

Question

Find spaces in a string sequence

I have a sequence of lines - 0000001, 0000002, 0000003.... up to 2 million. They are not adjacent. This means that there are gaps. Let's say after 0000003 the next line could be 0000006. I need to find out all these spaces. In the above case (0000004, 0000005).

This is what I have done so far -

 gaps = list() total = len(curr_ids) for i in range(total): tmp_id = '%s' %(str(i).zfill(7)) if tmp_id in curr_ids: continue else: gaps.append(tmp_id) return gaps

But, you guessed it, this is slow since I am using list . If I use dict to pre-populate curr_ids, it will be faster. But what is the difficulty of populating a hash table? What is the fastest way to do this.

+6

python

Srikar appalaraju Nov 26 '10 at 11:09

source share

4 answers

You can use bitarray to save a sequence of 2 million ints . Here, each bit means one integer (the integer of this index in bitarray). Code example:

 gaps = [] # bitarray is 0 based a = bitarray.bitarray(total + 1) a.setall(False) for sid in curr_ids: a[int(sid)] = True for i in range(1, total): if not a[i]: gaps.append('%07d' %(i)) return gaps

+3

Michał Niklas Nov 26 '10 at 11:33

source share

 seq = *the sequence of strings* n = 2000000 gaps = set(str(i).zfill(7) for i in range(1,n+1)) - set(seq)

+1

jairajs89 Nov 26 '10 at 11:23

source share

I would suggest taking an int instead of a string to process, and then adding the string to the output again

 j=0 n=2000000 #create a list of int number from your string foo = [i for i in range(n)] #creating gaps foo.remove(1) foo.remove(50) while j<n: for i in foo: if i>j: print '%07d'%j j+=1 j+=1

0

Rafi Nov 26 '10 at 15:49

source share

Gareth rees · Accepted Answer · 2010-11-26T11:28:55+0000

You can sort the list of identifiers and then execute it only once:

 def find_gaps(ids): """Generate the gaps in the list of ids.""" j = 1 for id_i in sorted(ids): while True: id_j = '%07d' % j j += 1 if id_j >= id_i: break yield id_j >>> list(find_gaps(["0000001", "0000003", "0000006"])) ['0000002', '0000004', '0000005']

If the input list is already in order, you can avoid sorted (although it does little harm: Python adaptive mergesort O (n) if the list is already sorted).

Find spaces in a string sequence

More articles: