Valid iterator size?

I am browsing some text file for a specific line using a method.

re.finditer(pattern,text) I would like to know when this returns nothing. this means that he could not find anything in the transmitted text.

I know that called iterators have next() and __iter__

I would like to know if I can get the size or find out if a string is returned that matches my pattern.

+6
python iterator
source share
6 answers

EDIT 3: The answer from @hynekcer is much better than this.

EDIT 2: This will not work if you have an infinite iterator, or one that consumes too many gigabytes (in 2010 1 gigabyte still remains a large amount of RAM / disk space) of RAM / disk space.

You have already seen a good answer, but here is an expensive hack that you can use if you want to eat a cake, and you too :) The trick is that we should clone the cake, and when you are done we will return it to the same box. Remember that when you iterate over an iterator, it usually becomes empty or at least loses the previously returned values.

 >>> def getIterLength(iterator): temp = list(iterator) result = len(temp) iterator = iter(temp) return result >>> >>> f = xrange(20) >>> f xrange(20) >>> >>> x = getIterLength(f) >>> x 20 >>> f xrange(20) >>> 

EDIT: Here is a safer version, but using it still requires some discipline. It does not feel pretty pythonic. You would get a better solution if you posted all the relevant code sample that you are trying to implement.

 >>> def getIterLenAndIter(iterator): temp = list(iterator) return len(temp), iter(temp) >>> f = iter([1,2,3,7,8,9]) >>> f <listiterator object at 0x02782890> >>> l, f = getIterLenAndIter(f) >>> >>> l 6 >>> f <listiterator object at 0x02782610> >>> 
+7
source share

This solution uses less memory because it does not save intermediate results, like other solutions that use list :

 sum(1 for _ in re.finditer(pattern, text)) 

All old solutions have the disadvantage that they take up a lot of memory if the pattern is very common in the text, for example the pattern '[az]'.

Precedent:

 pattern = 'a' text = 10240000 * 'a' 

This solution with sum(1 for...) uses approximately only memory for text as such, i.e. len(text) bytes. Previous solutions with list can use about 58 or 110 times more memory than necessary. This is 580 MB for 32-bit, respectively 1.1 GB for 64-bit Python 2.7.

+13
source share

No, miserable iterators do not need to know the length, they just know what is next, which makes them very effective when browsing Collections. Although they are faster, they do not allow indexing, including knowledge of the length of the collection.

+5
source share

You can get the number of elements in the iterator by doing:

 len( [m for m in re.finditer(pattern, text) ] ) 

Iterators are iterators because they have not yet formed a sequence. This code above basically extracts each element from the iterator until it wants to stop in the list and then takes the length of this array. Something that would be more memory efficient would be as follows:

 count = 0 for item in re.finditer(pattern, text): count += 1 

A sophisticated for-loop approach is to use shorthand to efficiently count elements in an iterator one at a time. This is actually the same as the for loop:

 reduce( (lambda x, y : x + 1), myiterator, 0) 

Basically ignored is the y passed in the abbreviation, and just adds it. It initializes the current amount to 0 .

+1
source share

Although some iterators may know their length (for example, they were created from a string or list), most of them cannot and cannot. re.iter is a good example of one that cannot know the length until it is complete.

However, there are several ways to improve your current code:

  • use re.search to find if there are matches, then use re.finditer to do the actual processing; or

  • use counter value with for loop.

The second option looks something like this:

 match = empty = object() for match in re.finditer(...): # do some stuff if match is empty: # there were no matches 
+1
source share

A quick solution is to turn your iterator into a list and check the length of this list, but it can be bad for memory if there are too many results.

 matches = list(re.finditer(pattern,text)) if matches: do_something() print("Found",len(matches),"matches") 
0
source share

All Articles