How to apply parallel or asynchronous I / O on a piece of python code

Question

How to apply parallel or asynchronous I / O on a piece of python code

To begin with, we are given the following code fragment:

from validate_email import validate_email import time import os def verify_emails(email_path, good_filepath, bad_filepath): good_emails = open(good_filepath, 'w+') bad_emails = open(bad_filepath, 'w+') emails = set() with open(email_path) as f: for email in f: email = email.strip() if email in emails: continue emails.add(email) if validate_email(email, verify=True): good_emails.write(email + '\n') else: bad_emails.write(email + '\n') if __name__ == "__main__": os.system('cls') verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")

I expect that communication with SMTP servers will be the most expensive part, far from my program, when emails.txt contains a large number of lines (> 1k). Using some form of parallel or asynchronous I / O should speed this up significantly, since I can wait for multiple servers to respond, rather than wait in sequence.

As far as I read:

Asynchronous I / O works by the order of the I / O request to the file descriptor, tracked independently of the calling process. For a file descriptor that supports asynchronous I / O (raw disks usually), the process can call aio_read () (for example) to request the number of bytes read from the file descriptor. A system call returns immediately if I / O has completed. Sometime later, the process then examines the operating system to complete the I / O (that is, the buffer is filled with data).

To be sincere, I did not quite understand how to implement asynchronous I / O in my program. Can someone take a little time and explain the whole process to me?

EDIT according to PArakleta:

 from validate_email import validate_email import time import os from multiprocessing import Pool import itertools def validate_map(e): return (validate_email(e.strip(), verify=True), e) seen_emails = set() def unique(e): if e in seen_emails: return False seen_emails.add(e) return True def verify_emails(email_path, good_filepath, bad_filepath): good_emails = open(good_filepath, 'w+') bad_emails = open(bad_filepath, 'w+') with open(email_path, "r") as f: for result in Pool().imap_unordered(validate_map, itertools.ifilter(unique, f): (good, email) = result if good: good_emails.write(email) else: bad_emails.write(email) good_emails.close() bad_emails.close() if __name__ == "__main__": os.system('cls') verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")

+7

python asynchronous io

Cajuu ' Oct 08 '15 at 13:36

source share

1 answer

Parakleta · Accepted Answer · 2015-10-15T00:48:53+0000

You are asking the wrong question

Looking at the validate_email package, your real problem is that you are not efficiently performing your results. You should only search MX once per domain, and then only connect to each MX server once, go through a handshake, and then check all the addresses for that server in one batch. Fortunately, the validate_email package caches MX results for you, but you still need to group the email addresses by the server in order to fulfill the request to the server itself.

You need to edit the validate_email package to implement batch processing, and then possibly give a thread for each domain using the actual threading library, not multiprocessing .

It is always important to profile your program if it slows down and find out where it actually spends time, rather than trying to blindly apply optimization tricks.

Requested solution

IO is already asynchronous if you use buffered I / O and your use case is suitable for OS buffering. The only place you could get any advantage is to read further, but Python already does this if you use iterator access to the file (which you do). AsyncIO is an advantage for programs that move large amounts of data and disable OS buffers to prevent data from being duplicated twice.

You need to actually profile / test your program to see if it has room for improvement. If your drives do not yet have bandwidth, there is a chance to improve performance by parallel processing each email address (address?). The easiest way to check this is probably to check if the kernel running your program is exceeded (i.e. you are connected to the CPU and not tied to IO).

If you are attached to a processor, you need to look at the threads. Unfortunately, Python threading does not work in parallel unless you have non-Python work, so you have to use multiprocessing (I assume that validate_email is a Python function).

How exactly you advance depends on where the bottleneck in your program is and how fast you need to get to the point where you are attached to IO (since you cannot actually go faster than you can stop optimizing when you get into this point).

The set emails object is hard to split because you will need to block it, so it’s best to save it in a single thread. Looking at the multiprocessing library, the easiest way is to use the Process Pools mechanism.

Using this, you will need to wrap your file iterable in itertools.ifilter , which discards duplicates and then passes that to Pool.imap_unordered , and then repeat that result and write to your two output files.

Something like:

 with open(email_path) as f: for result in Pool().imap_unordered(validate_map, itertools.ifilter(unique, f): (good, email) = result if good: good_emails.write(email) else: bad_emails.write(email)

The validate_map function should be simple:

 def validate_map(e): return (validate_email(e.strip(), verify=True), e)

The unique function should look something like this:

 seen_emails = set() def unique(e): if e in seen_emails: return False seen_emails.add(e) return True

ETA : I realized that validate_email is a library that actually communicates with SMTP servers. Given that it is not busy in Python code, you can use threads. The threading API is not as convenient as a multiprocessing library, but you can use multiprocessing.dummy to have a thread-based pool.

If you are attached to a processor, then you really should not have more threads / processes than cores, but since your bottleneck is network IO, you can use even more threads / processes. Since processes are expensive, you want to swap threads, and then collapse the number that runs in parallel (although you should be polite so as not to DOS-attack the servers you are connecting to).

Consider from multiprocessing.dummy import Pool as ThreadPool , and then call ThreadPool(processes=32).imap_unordered() .

How to apply parallel or asynchronous I / O on a piece of python code

You are asking the wrong question

Requested solution

More articles: