You are asking the wrong question
Looking at the validate_email package, your real problem is that you are not efficiently performing your results. You should only search MX once per domain, and then only connect to each MX server once, go through a handshake, and then check all the addresses for that server in one batch. Fortunately, the validate_email package caches MX results for you, but you still need to group the email addresses by the server in order to fulfill the request to the server itself.
You need to edit the validate_email package to implement batch processing, and then possibly give a thread for each domain using the actual threading library, not multiprocessing .
It is always important to profile your program if it slows down and find out where it actually spends time, rather than trying to blindly apply optimization tricks.
Requested solution
IO is already asynchronous if you use buffered I / O and your use case is suitable for OS buffering. The only place you could get any advantage is to read further, but Python already does this if you use iterator access to the file (which you do). AsyncIO is an advantage for programs that move large amounts of data and disable OS buffers to prevent data from being duplicated twice.
You need to actually profile / test your program to see if it has room for improvement. If your drives do not yet have bandwidth, there is a chance to improve performance by parallel processing each email address (address?). The easiest way to check this is probably to check if the kernel running your program is exceeded (i.e. you are connected to the CPU and not tied to IO).
If you are attached to a processor, you need to look at the threads. Unfortunately, Python threading does not work in parallel unless you have non-Python work, so you have to use multiprocessing (I assume that validate_email is a Python function).
How exactly you advance depends on where the bottleneck in your program is and how fast you need to get to the point where you are attached to IO (since you cannot actually go faster than you can stop optimizing when you get into this point).
The set emails object is hard to split because you will need to block it, so itβs best to save it in a single thread. Looking at the multiprocessing library, the easiest way is to use the Process Pools mechanism.
Using this, you will need to wrap your file iterable in itertools.ifilter , which discards duplicates and then passes that to Pool.imap_unordered , and then repeat that result and write to your two output files.
Something like:
with open(email_path) as f: for result in Pool().imap_unordered(validate_map, itertools.ifilter(unique, f): (good, email) = result if good: good_emails.write(email) else: bad_emails.write(email)
The validate_map function should be simple:
def validate_map(e): return (validate_email(e.strip(), verify=True), e)
The unique function should look something like this:
seen_emails = set() def unique(e): if e in seen_emails: return False seen_emails.add(e) return True
ETA : I realized that validate_email is a library that actually communicates with SMTP servers. Given that it is not busy in Python code, you can use threads. The threading API is not as convenient as a multiprocessing library, but you can use multiprocessing.dummy to have a thread-based pool.
If you are attached to a processor, then you really should not have more threads / processes than cores, but since your bottleneck is network IO, you can use even more threads / processes. Since processes are expensive, you want to swap threads, and then collapse the number that runs in parallel (although you should be polite so as not to DOS-attack the servers you are connecting to).
Consider from multiprocessing.dummy import Pool as ThreadPool , and then call ThreadPool(processes=32).imap_unordered() .