I have a lot of typing in the real world that I need to get out of a word for input into spellchecker. I would like to extract as many meaningful words as possible without too much noise. I know that there are many ordinary ninjas here, so hopefully someone can help me.
I am currently retrieving all alphabetical sequences using '[az]+' . This is a good approximation, but it carries a lot of garbage.
Ideally, I would like some regular expression (should not be beautiful or effective) that extracts all alphabetical sequences separated by natural word delimiters (for example, [/-_,.: ] , etc.) and ignores any alphabetical sequences with illegal borders.
However, I would be happy to just get all the alphabetical sequences that DON'T CONSTANT next to the number. So, for example, 'pie21' will NOT retrieve 'pie' , but 'http://foo.com' will retrieve ['http', 'foo', 'com'] .
I tried the lookahead and lookbehind , but they were applied for each character (therefore, for example, re.findall('(?<!\d)[az]+(?!\d)', 'pie21') will return 'pi' when I want him to return nothing). I tried wrapping the alpha part as a term ( (?:[az]+) ), but that didn't help.
Details: Data is an email database, so basically it is plain English with normal numbers, but sometimes there are garbage strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA and AC7A21C0 that I would like to completely ignore, I assume that any alphabetical sequence with a number in it trash.
orlade
source share