If NxM is large (say, 100), then the cost of iterating over A will be amortized basically.
Say the array is 1000 X 100 X 100.
The iteration is O (1000), but the cumulative cost of the internal function is O (1000 X 100 X 100) - 10,000 times slower. (Note, my terminology is a little fascinating, but I know what I'm talking about)
I'm not sure, but you can try the following:
result = numpy.empty(data.shape[0])
for i in range(len(data)):
result[i] = foo(data[i])
You would save a lot of memory when creating the list ... but the overhead of the loop would be greater.
Or you could write a parallel version of the loop and split it into several processes. This can be much faster, depending on how intense it is foo(since this would compensate for data processing).