Why does a filter (_ :) s get a predicate so many times when evaluating it lazily?

Question

Why does a filter (_ :) s get a predicate so many times when evaluating it lazily?

I saw the answer to this question , which in its first edition had code similar to this:

let numbers = Array(0 ..< 50) let result = numbers.lazy .filter { // gets called 2-3x per element in the range (0...15)! print("Calling filter for: \($0)") return $0 % 3 == 0 } .prefix(5) print(Array(result)) // [0, 3, 6, 9, 12]

which, through the use of a lazy collection of filters, can filter the first 5 elements of numbers that satisfy a given predicate (in this case, being divisible by 3), without having to evaluate each element in numbers .

However, the answer then noticed that the predicate filter(_:) s can be called several times per element (3 times for elements in the range 1 ... 15 and twice for 0, as it turns out).

What is the reason for this inefficiency in the lazy assessment of this filter? Is there a way to avoid evaluating the same item multiple times?

+7

swift

Hamish Jan 30 '17 at 15:59

source share

1 answer

Hamish · Accepted Answer · 2017-01-30T15:59:52+0000

Problems

The first culprit here is slicing a collection of lazy filters using prefix(_:) - which in this case returns a BidirectionalSlice LazyFilterBidirectionalCollection .

In general, slicing a Collection entails maintaining the base set along with a range of indices that are valid for viewing the fragment. Therefore, to create a LazyFilterBidirectionalCollection slice to view the first 5 elements, the range of stored indices must be startIndex ..< indexAfterTheFifthElement .

To get indexAfterTheFifthElement , LazyFilterBidirectionalCollection must LazyFilterBidirectionalCollection over the base collection ( numbers ) to find the 6th element that matches the predicate (you can see the exact implementation of indexing here ).

Therefore, all elements in the range 0 ... 15 of the above example should be checked for a predicate just to create a slice of the collection of lazy filters.

The second culprit is Array s init(_:) , which accepts the Sequence elements of the same type as the Element array type. The implementation of this initiator calls _copyToContiguousArray() in the sequence, which for most sequences redirects the call to this function :

 internal func _copySequenceToContiguousArray<S : Sequence> (_ source: S) -> ContiguousArray<S.Iterator.Element> { let initialCapacity = source.underestimatedCount // <- problem here var builder = _UnsafePartiallyInitializedContiguousArrayBuffer<S.Iterator.Element>( initialCapacity: initialCapacity) var iterator = source.makeIterator() // FIXME(performance): use _copyContents(initializing:). // Add elements up to the initial capacity without checking for regrowth. for _ in 0..<initialCapacity { builder.addWithExistingCapacity(iterator.next()!) } // Add remaining elements, if any. while let element = iterator.next() { builder.add(element) } return builder.finish() }

The problem here is underestimatedCount . For simple sequences, it just has a default implementation that returns 0, however for collections, it has a default implementation that gets the count collection (I go into this one in more details here ).

The default implementation for count for Collection (which will be used here by BidirectionalSlice ):

 public var count: IndexDistance { return distance(from: startIndex, to: endIndex) }

which for our fragment will go through indexes to indexAfterTheFifthElement , thereby overestimating elements in the range 0 ... 15 again.

Finally, the slice iterator is executed and iterated by adding each element to the sequence in a new array buffer. For BidirectionalSlice this will use an IndexingIterator , which simply works by moving indexes and outputting an element for each index.

The reason this walk through the indices does not overestimate the elements to the first element of the result (note in the example questions, a rating of 0 is evaluated one time less) is due to the fact that it does not have direct access to startIndex LazyFilterBidirectionalCollection , which should evaluate all elements before the first item as a result . Instead, the iterator can work from the initial index of the slice itself.

Decision

A simple solution is to avoid slicing the collection of lazy filters to get its prefix, but use the lazy prefix instead.

There are actually two implementations of prefix(_:) . One is provided by Collection and returns SubSequence (which is some form of slice for most standard library collections).

Another is provided by Sequence , which returns AnySequence - which in the hood uses the basic _PrefixSequence sequence, which simply takes an iterator and allows iteration through it until the specified number of elements is repeated, and then simply stops returning elements.

For lazy collections, this prefix(_:) implementation is great, since it does not require any indexing - it just lazily applies the prefix.

Therefore, if you say:

 let result : AnySequence = numbers.lazy .filter { // gets called 1x per element :) print("Calling filter for: \($0)") return $0 % 3 == 0 } .prefix(5)

The numbers elements (up to the 5th match) will be evaluated only once by the predicate filter(_:) s when passing to Array initialiser, since you force Swift to use the default implementation of Sequence prefix(_:) .

A safe way to prevent all indexing operations in a given lazy filter collection is to simply use a lazy filter sequence instead - this can be done by simply wrapping the collection in which you want to perform lazy operations in AnySequence :

 let numbers = Array(0 ..< 50) let result = AnySequence(numbers).lazy .filter { // gets called 1x per element :) print("Calling filter for: \($0)") return $0 % 3 == 0 } .dropFirst(5) // neither of these will do indexing, .prefix(5) // but instead return a lazily evaluated AnySequence. print(Array(result)) // [15, 18, 21, 24, 27]

However, note that for bidirectional collections, this can adversely affect operations at the end of the collection, as the entire sequence must be repeated to achieve the goal.

For operations like suffix(_:) and dropLast(_:) , it may be more efficient to work with a lazy collection in sequence (at least for small inputs), since they can simply be indexed from the end of the collection.

Although, as with all performance-related issues, you should first check if this is really a problem, and then run your own tests to find out which method is best for your implementation.

Conclusion (TL; DR)

So, after all this - the lesson that you will learn here is that you should be careful that slicing a collection of lazy filters may overestimate each element of the base collection to the end of the index, the view.

Often, it is more desirable to treat the lazy collection of filters as a sequence instead, which cannot be indexed, which means that lazy operations cannot evaluate any elements (this would risk destructively iterating over them) until an energetic operation occurs .

However, you should beware that you are potentially sacrificing the ability to index the collection from the end, which is important for operations such as suffix(_:) .

Finally, it is worth noting that this is not a problem with lazy views such as LazyMapCollection , because their elements are independent of the “results of previous elements”, so they can be indexed at a constant time if their base collection is a RandomAccessCollection .

Why does a filter (_ :) s get a predicate so many times when evaluating it lazily?

Problems

Decision

Conclusion (TL; DR)

More articles: