Processing only non-empty lines

Question

Processing only non-empty lines

I have the following code snippet

def send(self, queue, fd): for line in fd: data = line.strip() if data: queue.write(json.loads(data))

Which, of course, works great, but sometimes I wonder if there is a “better” way to write this construct, where you will only act on non-empty lines.

The task is to use an iterative character to read "fd" and the ability to process files in the range of 100+ MB.

UPDATE - In your rush to get points for this question, you are ignoring the import part, which is memory usage. For example, the expression:

  non_blank_lines = (line.strip() for line in fd if line.strip())

Collects a buffer to store the entire file in memory, not to mention the execution of the strip () action twice. Which will work for small files, but it doesn’t work when you have 100 + MB of data (or 100 GB from time to time).

Part of the problem is the following works, but there is a soup for reading:

 for line in ifilter(lambda l: l, imap(lambda l: l.strip(), fd)): queue.write(json.loads(line))

Look for magical people!

FINAL UPDATE: PEP-289 is very useful for my better understanding of the difference between [] and () involving iterators.

+6

python

koblas Dec 03 '12 at 17:26

source share

2 answers

There simply isn’t a “better” way than yours, it works the way it was supposed, it is easy to read, etc. However, if you classify speed as “better,” small adjustments can be made for sure.

I didn't know much about this speed in Python, but here are a few suggestions that work only under certain conditions. I hope someone else comes up with something better, maybe this answer will help them.

If the file will not contain lines such as

\n

but instead only \n , then this path will be noticeably faster:

 def send(self, queue, fd): for line in fd: if line != '\n': queue.write(json.loads(line.strip()))

Time Values:

 using: strip() :: 1.8722578811916337 using: line != '\n' :: 1.0126976271093881 using: line != '\n' and line != ' \n' :: 1.2862439244170275

Please note, however, that this may become even slower if there is no single line \n in the file, I confined it to fd as ["string", "\n", "test string", "\n", "moreeee", "\n", "An other element"]

You probably don't know if the lines are only \n , however .strip() is pretty slow, so there might be more efficient ways.

+1

user1632861 Dec 03 '12 at 18:19

source share

cmh · Accepted Answer · 2012-12-03T17:35:18+0000

There is nothing wrong with the code being written; it is readable and efficient.

An alternative approach would be to write it as an understanding of the generator:

 def send(self, queue, fd): non_blank_lines = (line.strip() for line in fd if line.strip()) for line in non_blank_lines: queue.write(json.loads(data))

This approach can be useful (terser) if you use a function that can take an iterator: for example. python3 print

 non_blank_lines = (line.strip() for line in fd if line.strip()) print(*non_blank_lines, file='foo')

To eliminate multiple strip () calls, combine common generator concepts

 stripped_lines = (line.strip() for line in fd) non_blank_lines = (line for line in stripped_lines if line)

Note that generator expressions will not adversely affect memory, as described in this pep .

For a deeper understanding of this approach and some performance notes, see this set of notes .

Finally, note that rstrip () will be superior to strip () unless you require the full behavior of strip ().

Processing only non-empty lines

More articles: