I can find a few snippets of Google code by doing a search to normalize the email address "but nothing is enough. I'm afraid you have to write your own tool. If I wrote such a tool, here are a few rules that I think I would apply :
First, the tool would omit the domain name (after @). This should not be too complicated if you do not want to process emails with international domain names . For example, JoE@caFÉ.fR (note the emphasis on E) must first go through Nameprep . This leads to JoE@xn--caf-dma.fr. I have never seen anyone with such an international email address, but I suspect you might find him in China or Japan, for example.
RFC 5322 claims that the local part of the letter (before @) is case sensitive , but the actual standard for almost all providers should ignore the case (I have never seen the email address actually used by the person, but I suppose there are still some system administrators who use their Un * x email accounts where matters matter). I think the tool should have an option to ignore the case for a list of domain names (or, conversely, for case sensitivity only for a list of domain names). So, at this point, the email address JoE@caFÉ.fR is now normalized to joe@xn--caf-dma.fr.
The question again is about international email addresses (not ASCII). What if the local part is not ASCII? For example, something like 甲 斐 @ 黒 川. 日本 (disclaimer: I do not speak Japanese). RFC 5322 prohibits this, but recent RFCs support this (see this wikipedia article ). Many languages do not have the concept of lower or upper case. When they do this, if you want to switch to lowercase form, be sure to use the appropriate Unicode lower case algorithms, which is not always trivial. For example, in German, the lower case of the word “Gross” may be “pennies” or “grosses” (disclaimer: I do not speak German either). So, at the moment, the email address "Großes@caFÉ.Fr" should be normalized to "grosses@xn--caf-dma.fr".
I did not read RFC 5322 in detail, but I think it is possible to have comments in the email address , either at the beginning or at the end of the local part, for example (sir) john.lennon@beatles.com or john.lennon (ono) @ beatles.com. These comments must be removed (this will result in john.lennon@beatles.com. Removing comments is not trivial because I don’t know what to do with nested comments, and comments enclosed in double quotes should not be in accordance with RFC (if I'm not mistaken) For example, the comment in the following email address should not be deleted, according to the RFC: "john. (Ono) .lennon" @ beatles.com.
Once the email is thus normalized, I would apply the "provider specific" rules that you propose. For example, delete points in GMail addresses and mix equivalent domain names (e.g. googlemail.com == gmail.com). I think I would leave this really separate from the previous normalization steps.
Please note that Gmail also ignores the plus sign (+) and everything after it, for example smith+hello_world@gmail.com is equivalent to smith@gmail.com.
I do not know any other provider rules. The fact is that these rules can change at any time, you will have to track them all.
I think about this. If you come up with some kind of working code, I would be very interested to see it.
Hooray!