Extracting whole words

I have a lot of typing in the real world that I need to get out of a word for input into spellchecker. I would like to extract as many meaningful words as possible without too much noise. I know that there are many ordinary ninjas here, so hopefully someone can help me.

I am currently retrieving all alphabetical sequences using '[az]+' . This is a good approximation, but it carries a lot of garbage.

Ideally, I would like some regular expression (should not be beautiful or effective) that extracts all alphabetical sequences separated by natural word delimiters (for example, [/-_,.: ] , etc.) and ignores any alphabetical sequences with illegal borders.

However, I would be happy to just get all the alphabetical sequences that DON'T CONSTANT next to the number. So, for example, 'pie21' will NOT retrieve 'pie' , but 'http://foo.com' will retrieve ['http', 'foo', 'com'] .

I tried the lookahead and lookbehind , but they were applied for each character (therefore, for example, re.findall('(?<!\d)[az]+(?!\d)', 'pie21') will return 'pi' when I want him to return nothing). I tried wrapping the alpha part as a term ( (?:[az]+) ), but that didn't help.

Details: Data is an email database, so basically it is plain English with normal numbers, but sometimes there are garbage strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA and AC7A21C0 that I would like to completely ignore, I assume that any alphabetical sequence with a number in it trash.

+8
source share
4 answers

If you are restricting yourself to ASCII letters, use (with the re.I option)

 \b[az]+\b 

\b is the anchor of the word boundary, corresponding only at the beginning and end of alphanumeric words. Therefore, \b[az]+\b matches pie , but not pie21 or 21pie .

To also allow non-ASCII letters, you can use something like this:

 \b[^\W\d_]+\b 

which also allows you to use accented characters, etc. You may need to set the re.UNICODE parameter, especially when using Python 2, so that the \w abbreviation matches non-ASCII letters.

[^\W\d_] as a negative character class, allows any alphanumeric character, with the exception of numbers and underscores.

+16
source

Are you familiar with word boundaries? ( \b ). You can extract the word using \b around the sequence and matching the alphabet inside:

 \b([a-zA-Z]+)\b 

For example, it will capture whole words, but dwell on tokens such as hyphens, periods, half-columns, etc.

You can \b and others in the python manual

EDIT Also, if you are looking at a number that follows or precedes a match, you can use a negative forward / reverse forecast:

 (?!\d) # negative look-ahead for numbers (?<!\d) # negative look-behind for numbers 
+3
source

What about:

 import re yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA pie42" filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))]) 

Note that:

  • split splits your string into potential candidates => returns a list of "potential words"
  • set does unicity filtering => converts the list into a set, thus deleting entries that appear more than once. This step is optional.
  • the filter reduces the number of candidates: takes a list, applies a test function to each element and returns a list of the element following the test. In our case, the test function is "anonymous"
  • lambda: anonymous function, taking an element and checking if it is a word (upper or lower case letters only)

EDIT: some clarification added

+2
source

Code example

 print re.search(ur'(?u)\b', ur'') print re.search(ur'(?u)\b\b', ur'') 

or

 s = ur"abcd " import re rx1 = re.compile(ur"(?u)") rx2 = re.compile(ur"(?u)\b") rx3 = re.compile(ur"(?u)\b\b") print rx1.findall(s) print rx2.findall(s) print rx3.findall(s) 
0
source

All Articles