In my own tests, the performance difference was even more noticeable than in your question. Differences were still clearly observed after an increase in the second and third sizes of the arr data. It also remained observable after commenting on one of the two comparison functions ( greater_equal or logical_or ), which means that we can eliminate some strange interaction between them.
By changing the implementation of the two methods to the following, I could significantly reduce the observed performance difference (but not completely eliminate it):
def method1(arr, low, high): out = np.empty(arr.shape, dtype=np.bool) high = np.ones_like(arr) * high low = np.ones_like(arr) * low np.greater_equal(arr, high, out) np.logical_or(out, arr < low, out) return out def method2(arr, low, high): out = np.empty(arr.shape, dtype=np.bool) high = np.ones_like(arr) * high low = np.ones_like(arr) * low for k in range(arr.shape[0]): a = arr[k] o = out[k] h = high[k] l = low[k] np.greater_equal(a, h, o) np.logical_or(o, a < l, o) return out
I believe that when supplying high or low as a scalar for these numpy functions, they can first create an array with a zero sized regular shape filled with this scalar. When we do this manually outside functions, in both cases only once for the full form, the difference in performance becomes much less noticeable. This implies that for some reason (maybe a cache?), Creating such a large array filled with the same constant may be less efficient than creating smaller k arrays with the same constant (which is done automatically when implementing method2 in the original issue).
Note: in addition to narrowing the performance gap, it also significantly degrades the performance of both methods (affecting the second method more strongly than the first). So, although this may give some idea of where the problem is, it may not explain everything.
EDIT
Here is the new version of method2 , where we now manually create smaller arrays in a loop every time, for example, as I suspect, happens inside numpy in the original implementation in question:
def method2(arr, low, high): out = np.empty(arr.shape, dtype=np.bool) for k in range(arr.shape[0]): a = arr[k] o = out[k] h = np.full_like(a, high) l = np.full_like(a, low) np.greater_equal(a, h, o) np.logical_or(o, a < l, o) return out
This version is indeed much faster than the previous one (confirming that creating multiple smaller arrays inside the loop is more efficient than one large outside the loop), but still slower than the original implementation in the question.
Assuming that these numpy functions really convert scalar boundaries to these types of arrays, the difference in performance between this last function and what in this question may be related to creating arrays in Python (my implementation) vs. do it initially (original implementation)