Library for canonicalizing (normalizing, but not just clearing) email addresses

There are several ways to create email address strings that differ by direct string comparison (see below), but are logically equivalent (i.e. mail sent to both mailboxes). This often allows users to provide unique email addresses, even if strict equality has been prohibited.

I was hoping to find a library that I would try to normalize to find some duplicates from large sets of email addresses. The goal here is to find as many duplicates as possible. Given how useful this is for several purposes (in my case, this is a simple detection of abuse, because abuse reports tend to (try) just reuse certain accounts), I think that existing solutions may exist.

So what things can vary? I know at least things like:

  • part of the domain name is case insensitive (according to DNS); but the local part may or may not be, it depends on the mail provider (for example, Gmail considers it case-insensitive)
  • many domains have aliases (googlemail.com is equivalent to gmail.com)
  • some email providers allow other options that they ignore (e.g. gmail ignores any dots in the email address!)

Ideally, this will be in Java, although scripting languages ​​will also work (command line tool)

+8
java email email-validation normalization
source share
2 answers

I can find a few snippets of Google code by doing a search to normalize the email address "but nothing is enough. I'm afraid you have to write your own tool. If I wrote such a tool, here are a few rules that I think I would apply :

First, the tool would omit the domain name (after @). This should not be too complicated if you do not want to process emails with international domain names . For example, JoE@caFÉ.fR (note the emphasis on E) must first go through Nameprep . This leads to JoE@xn--caf-dma.fr. I have never seen anyone with such an international email address, but I suspect you might find him in China or Japan, for example.

RFC 5322 claims that the local part of the letter (before @) is case sensitive , but the actual standard for almost all providers should ignore the case (I have never seen the email address actually used by the person, but I suppose there are still some system administrators who use their Un * x email accounts where matters matter). I think the tool should have an option to ignore the case for a list of domain names (or, conversely, for case sensitivity only for a list of domain names). So, at this point, the email address JoE@caFÉ.fR is now normalized to joe@xn--caf-dma.fr.

The question again is about international email addresses (not ASCII). What if the local part is not ASCII? For example, something like 甲 斐 @ 黒 川. 日本 (disclaimer: I do not speak Japanese). RFC 5322 prohibits this, but recent RFCs support this (see this wikipedia article ). Many languages ​​do not have the concept of lower or upper case. When they do this, if you want to switch to lowercase form, be sure to use the appropriate Unicode lower case algorithms, which is not always trivial. For example, in German, the lower case of the word “Gross” may be “pennies” or “grosses” (disclaimer: I do not speak German either). So, at the moment, the email address "Großes@caFÉ.Fr" should be normalized to "grosses@xn--caf-dma.fr".

I did not read RFC 5322 in detail, but I think it is possible to have comments in the email address , either at the beginning or at the end of the local part, for example (sir) john.lennon@beatles.com or john.lennon (ono) @ beatles.com. These comments must be removed (this will result in john.lennon@beatles.com. Removing comments is not trivial because I don’t know what to do with nested comments, and comments enclosed in double quotes should not be in accordance with RFC (if I'm not mistaken) For example, the comment in the following email address should not be deleted, according to the RFC: "john. (Ono) .lennon" @ beatles.com.

Once the email is thus normalized, I would apply the "provider specific" rules that you propose. For example, delete points in GMail addresses and mix equivalent domain names (e.g. googlemail.com == gmail.com). I think I would leave this really separate from the previous normalization steps.

Please note that Gmail also ignores the plus sign (+) and everything after it, for example smith+hello_world@gmail.com is equivalent to smith@gmail.com.

I do not know any other provider rules. The fact is that these rules can change at any time, you will have to track them all.

I think about this. If you come up with some kind of working code, I would be very interested to see it.

Hooray!

+16
source share

I am using Apache James Mime4J to parse email addresses.

  • Handles (comments) correctly and removes them from localPart and domainPart

  • It correctly handles "posted and quoted" and + tagged local parts.

  • It has getLocalPart () and getDomainPart () methods.

  • Does not normalize gmail localParts.

+4
source share

All Articles